AD-AU8  493  CARNESI E-MELLON  UNIV  PITTSBURGH  PA  DEPT  OF  COMPUTER  —ETC  F/G  9/2 

WORKINO  PAPERS  IN  ACGUISITION  OF  KNOWLEDGE  FOR  IMAGE  UNDERSTAND— ETC <U> 
DEC  76  0  AKIN*  R  REDDY#  R  OHLANOER#  M  SCHULTZ  F44620-73-C-0074 


UNCLASSIFIED 


JOSS-**;. 


mm 


jSSSiS'  ■rist*  a-jft 


mm 


•r^r  'r 


sup 


%0ma^ 

i?s(:#"'Sf® 


i 


i 


WORKING  PAPERS  IN  ACQUISITION  OF  KNOWLEDGE  FOR 
IMAGE  UNDERSTANDING  RESEARCH 


Omer  Akin*,  Raj  Reddy,  Ronald  Ohlander 
and  Marty  Schultz 
Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania  15213 


December  15,  1976 


I 


% 


*  Also  with  the  Department  of  Architecture.  This  research  has  been  sponsored 
in  part  by  the  Defense  Advanced  Projects  Agency  under  contract  no.  F44620-73-C- 
0074  and  monitored  by  the  Air  Force  Office  of  Scientific  Research. 


i 


CONTENTS 


1.  Knowledge  Acquisition  in  Image  Understanding  Research, 
by  Omer  Akin  and  Raj  Reddy. 

2.  An  Experimental  System  for  Knowledge  Acquisition  in 
Image  Understanding  Research,  by  Ronald  Ohlander,  Raj  Reddy 
and  Omer  Akin. 

3.  An  Interactive  Protocol  Analysis  System  for  Knowledge 
Acquisition,  by  Omer  Akin  and  Marty  Schultz. 

4.  Eye  Fixations  for  Image  Understanding  Research, 
by  Omer  Akin. 


ABSTRACT 


Use  of  knowledge  has  facilitated  complex  problem  solving  in  many  areas  of 
research.  However,  in  the  Image  Understanding  area,  we  do  not  have  any  systematic 
treatment  and  codification  of  knowledge  that  is  useful  in  image  perception.  Further, 
we  do  not  even  have  adequate  tools  for  acquiring  the  necessary  knowledge  base.  In 
this  report  we  present  an  experimental  paradigm  for  knowledge  acquisition,  discuss  an 
analysis  technique,  and  illustrate  the  different  typos  of  knowledge  that  seem  to  be 
useful  in  image  understanding  research. 

In  the  first  paper,  three  major  aspects  of  knowledge  are  presented:  primitive 
Feature  Extraction  Operators,  Rewriting  Rules,  and  Flow  of  Control.  A  limited  number 
of  Feature  Extraction  Operators  were  repeatedly  used  by  the  subjects  to  specify 
location,  size,  shape,  quantity,  color,  texture,  and  patterns,  of  various  components  found 
in  scenes.  Six  types  of  Rewriting  Rules  were  identified;  assertions,  negative  assertions, 
context-free,  conditional,  generative,  and  analytical  inferences.  Flow  of  Control 
exhibited  characteristics  of  an  hypothesize  and  test  paradigm  capable  of  using 
imprecise,  conflicting  hypotheses  in  cooperation  with  others  in  a  multi-dimensional 
problem  space. 

The  second  paper  discusses  the  picture-puzzle  paradigm  and  the  various  ways 
in  which  it  can  be  used  as  a  tool  for  acquisition  of  knowledge.  The  third  paper  deals 
with  a  computer  program  that  assists  the  transcription  of  typical  protocols  obtained 
from  the  picture  puzzle  tasks.  Finally,  the  last  paper  of  the  report  discusses  the  pros 
and  cons  of  using  eye-fixation  data  to  acquire  knowledge  used  in  some  of  the  tasks  of 
the  picture-puzzle  paradigm. 

A 

The  total  effort  represents  an  account  of  the  initial  results  of  a  new 
experimental  paradigm.  We  hope  that  this  will  provide  a  sound  basis  for  understanding 
the  issues  of  knowledge  used  in  visual  perception  and  aid  in  the  modelling  of  "seeing" 
systems. 


% 


Knowledge  Acquisition  for  Image  Understanding  Research 

Omer  Akin*  and  Raj  Reddy 
Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania  15213 


I.  INTRODUCTION 

Most  researchers  believe  that  knowledge  based  systems  will  permit  significant 
advances  in  the  analysis,  description,  and  interpretation  of  images.  The  fact  that 
knowledge  can  be  used  to  constrain  search  has  been  demonstrated  in  several  other 
areas  such  as  chemistry  and  speech  understanding.  In  the  past,  researchers  have  used 
terms  such  as  "linguistic  approach"  to  indicate  the  desirability  of  using  known 
structural  relationships.  More  recently  the  term  "semantics"  has  been  used  to 
represent  the  knowledge  based  aspects  of  image  processing  research. 

In  general,  however,  knowledge  comes  in  all  sizes  and  shapes  and  the  use  of  all 
such  such  knowledge  can  be  helpful  in  the  analysis  and  description  of  images.  We  call 
this  process  Image  Understanding  as  being  distinct  from  scene  analysis  and  pattern 
recognition.  (Reddy  and  Newell,  1975)  The  main  contributions  of  this  paper  are  to 
present  an  experimental  paradigm  for  knowledge  acquisition,  discuss  an  analysis 
technique,  and  illustrate  different  types  of  knowledge  that  seem  to  be  useful  in  image 
understanding  research. 

Several  earlier  systems  have  attempted  to  use  problem  specific  knowledge  in 
the  analysis  of  scenes.  Guzman  (1968)  has  used  knowledge  based  on  spatial 
relationships.  Although  his  system  is  capable  of  dealing  with  complex  scenes,  its 
performance  is  limited  to  planar  surfaced  objects.  The  kinds  of  spatial  relations  used 
by  Guzman  also  appear  as  a  pari  of  our  results  and  will  later  be  discussed  under 
knowledge  about  useful  primitive  features. 

Kelly  (1970)  has  used  a  specialized  knowledge  representation  to  recognize 
pictures  of  people.  He  used  specific,  salient  features  of  different  parts  of  the  human 
body  to  detect  them.  In  this  sense,  he  developed  a  predefined  knowledge  base  for  his 
specific  area  of  application,  limited  by  the  special  requirements  of  the  recognition  tack. 

Waltz  (1972)  attempts  to  use  knowledge  represented  as  constraints.  In  addition 
to  the  spatial  relations  used  in  identifying  regions,  he  uses  prior  knowledge  about 
shadows  and  occlusions  to  process  complex  scenes. 

Most  of  these  studies  tend  to  deal  with  knowledge  in  a  task-specific,  specialized 
manner.  This  paper  presents  a  general  paradigm  of  research  and  permits 
generalizations  across  domain  dependent  knowledge  in  a  systematic  way. 


*  Also  with  the  Department  of  Architecture. 


Turning  our  attention  to  the  other  related  area  of  research,  cognitive 
psychology,  we  find  that  .the  research  contributions  there  are  not  very  helpful  either. 
Research  on  vision  provides  an  abundance  of  psychometric  information  about  human 
vision  (Julesz,  1 97 1 ;  Hochberg,  1964,  1968).  Recent  studies  in  cognitive  psychology 
have  accumulated  considerable  substance  about  the  information  processing  aspects  of 
perception  (Farley,  1974;  Moran,  1973;  Baylor,  1971).  However,  the  content  and  role 
of  Knowledge  in  visual  perception  has  been  almost  completely  neglected.  There  are 
three  major  reasons  lor  this  which  are  directly  related  to  the  tradition  of  research  in 
experimental  psychology. 

For  one  thing,  in  the  standard  psychology  experiment  the  Knowledge  available  to 
subjects  is  generally  an  issue  to  be  controlled  rather  than  investigated.  This  is  largely 
why  almost  all  major  studies  in  vision,  with  the  exception  of  a  few  (Buswell,  1935; 
Shepard,  1976)  deal  with  abstract  stimuli  with  measurable  information  content. 

Secondly,  the  usual  measures  used  to  calibrate  the  independent  variables  are 
not  suitable  to  measure  Knowledge.  These  measures  are  either  based  on  reaction 
times  or  eye-fixation  data  (Guswell,  1935;  Shepard,  1976;  Loftus,  1974).  The  obvious 
logical  explanation  for  this  is  that  the  ultimate  goaf  of  all  of  these  efforts  is  to  develop 
models  for  perception  where  the  calibration  of  processing  parameters  is  of  utmost 
importance.  Recently  protocol  analysis,  though  potentially  useful  for  investigating 
issues  of  Knowledge,  has  been  used  towards  the  same  ends. 

Finally,  no  adequate  tools  of  analysis  are  available  to  interpret  and  codify  the 
data  obtained  such  that  issues  of  knowledge  can  be  dealt  with  directly.  This  is  partly 
due  to  a  lack  of  interest  in  codifying  Knowledge  specifically,  in  the  area  of  vision. 

On  the  other  hand,  the  accumulation  of  the  findings  of  previous  research  in 
psychology  has  contributed  to  our  present  ability  to  deal  with  the  Knowledge 
acquisition  issue.  From  the  studies  on  the  processes  of  visual  perception  we  derive 
the  existence  of  special  mechanisms  for  inference  making  and  selective  processing.  In 
studies  with  eye-fixations  and  studies  with  simple,  abstract  stimuli  there  is 
complementary  evidence  to  the  availability  of  special  operators  for  extraction  of  visual 
features.  We  hope  that  these  studies  and  the  tool  proposed  here  for  exploring  the 
Knowledge  acquisition  issue  will  be  complementary  to  one  another. 

The  research  tool  proposed  here  is  similar  to  Newell’s  (1968)  protocol  analysis 
method  and  Woods  and  Makhout’s  (1974)  simulations.  Our  "picture-puzzle"  paradigm 
consists  of  providing  a  man-machine  system  which  simulates  a  semi-visual-and-semi- 
verbal  channel  that  can  transmit  information  to  subjects  about  a  given  visual  scene, 
when  requested.  The  protocols  we  analyzed  were  obtained  from  the  picture-puzzle 
task.  The  analysis  consists  of  scanning  the  protocols  for  the  occurrence  of  different 
Kinds  of  Knowledge  sources  used  by  the  subjects. 

This  basically  is  very  similar  to  protocol  analysis  in  the  usual  sense.  However,  in 
this  case  a  detailed  description  of  the  problem  states  and  the  problem  behavior  graph 
is  not  necessary  for  the  analysis.  Wc  merely  inspect  the  protocols  for  repeated 
patterns  of  utterances  and  behavior  without  a  real  need  to  formalize  the  problem 
space.  So,  this  study  differs  from  Woods  and  MakbouPs  simulation  in  that  it  attempts 


2 


to  explore  what  Knowledge  sources  there  are  rather  than  investigating  how  predefined 
sources  of  Knowledge  cooperate  in  a  given  task  environment. 

Another  study  by  Furschein  and  Fischler  (1972)  comes  closer  to  the  picture- 
puzzle  paradigm  than  any  other.  They  have  collected  protocols  on  subjects  verbally 
describing  scenes,  after  examining  them  visually.  The  control  variable  they  use  is  the 
purpose  of  the  description,  which  covers  things  liKe  description  ot  scenes  tor  a  general 
purpose  data-base  or  for  a  city-planning  data-base.  The  analysis  focuses  on  the 
content  and  syntax  of  the  scene  descriptions  provided  by  the  subjects  while  ours  is 
concerned  with  the  Knowledge  and  mechanisms  useful  in  generating  these 
transcriptions  in  the  first  place. 

II.  METHOD 

The  picture-puzzle  paradigm  used  was  human  simulation  of  image  understanding 
under  conditions  similar  to  machine  perception.  Each  subject  was  asked  to  find  out  the 
contents  of  a  color  photograph  of  a  scene  without  visually  seeing  it.  The  subjects 
were  allowed  to  ask  questions  about  the  scene  and  the  experimenter  answered  these 
questions  using  the  actual  photograph  as  reference.  The  questions  were  limited  to  the 
lower  level  attributes  of  the  scene.  By  lower  level  we  mean  information  that  does  not 
specify  object  concepts  but  properties  of  segments  and  regions  such  as  location,  color, 
shape  etc. 

The  picture-puzzle  paradigm  has  three  advantages  for  the  purposes  of  this 
study:  a)  the  phenomenon  of  visual  perception  has  been  removed  from  the  status  of  a 
spontaneous  (uncontrolled)  human  behavior  and  placed  in  the  status  of  problem 
solving,  just  as  in  the  case  of  computer  vision,  b)  the  visual  perception  process  has 
been  slowed  down  by  several  orders  of  magnitude,  c)  the  interaction  of  the  image  and 
the  subject  is  channeled  via  the  experimenter  so  that  it  can  be  recorded  in  the  form  of 
a  protocol. 

A.  EXPERIMENTAL  SET-UP 

Two  video  terminals  used  in  the  experiments  were  connected  at  the  onset  of  the 
experimental  session  by  means  of  software  (TAI.KK  program)  to  enable  typed-in 
communication  between  their  users.  The  teletypes  were  located  such  that  no  visual 
communication  between  their  users  was  possible  (Figure  7).  The  subject  and  the 
experimenter  were  able  to  communicate  only  verbally  thru  the  TAI.KK  program. 

TALKK  was  designed  to  record  all  statements  made  by  both  the  subject  and  the 
experimenter  throughout  an  experiment.  It  also  enabled  them  to  input  conjectures  and 
notes  about  the  task  at  any  time  during  the  experiment.  It  further  enabled  subjects  to 
correlate  their  personal  notes  and  drawings  that  they  were  allowed  to  make  on 
separate  sheets  of  paper,  with  the  protocol.  A  more  detailed  description  of  the  system 
is  given  in  Ohlander,  Reddy  and  Akin  (1976). 

B.  SUBJECTS 

The  main  objective  of  this  study  is  to  observe  the  knowledge  used  by  humans  in 

% 


3 


£ 


visual  understanding.  In  order  to  obtain  a  "good"  sample  of  instance  of  such 
knowledge  we  selected  subjects  that  were  superior  to  the  average  individual  in  some 
relevant  respect.  Four  of  the  six  subjects  used  were  knowledgeable  in  information 
processing  and/or  visual  perception.  The  remaining  two  were  architects  by  profession 
and  had  considerable  practice  in  solving  complex  visual  problems.  Alt  subjects  were 
college  graduates  with  graduate  education  ranging  through  the  Ph.D.  level.  Hence,  the 
subjects  were  a  priori  assumed  to  have  considerable  expertise  in  visual  information 
processing. 

C.  STIMULI  AND  THF.  TASK 

The  stimuli  used  were  produced  for  and  used  in  automated  image  understanding 
research  by  Ohlander  (1975).  Aft  the  scenes  were  constructed  or  selected  as  usual 
natural  images  of  familiar  objects.  Figures  1  through  6  contain  the  images  used. 

The  subjects  were  simply  instructed  to  understand  the  contents  of  a  stimulus  so 
that  they  would  be  able  to  describe  all  major  objects  in  it  immediately  after  the 
experiment.  The  experiments  were  terminated  when  the  subjects  thought  they 
understood  all  major  objects  in  the  scene  or  at  the  end  of  2  and  1/2  hours,  which  ever 
came  first.  Only  one  subject  continued  with  the  experiment  after  the  2-1/2  hours. 

The  subjects  were  required  to  perform  the  picture-puzzle  task  with  any  one  of 
the  six  different  stimulus  scenes.  At  no  time  during  the  experiments  were  the  subjects 
allowed  to  see  these  photographs  .  However  they  were  required  to  ask  questions 
about  them  to  the  experimenter,  who  at  ail  times  had  the  photographs  available  for  his 
visual  examination. 

All  verbalizations  from  the  protocols  were  automatically  recorded  by  the 
software  used.  Five  types  of  entries  constituted  a  protocol:  questions,  conjectures  and 
personal  notes  of  subject;  and  answers  and  notes  by  the  experimenter.  At  the  end  of 
each  session  subjects  were  asked  to  recognize  the  stimulus  picture  among  19  other 
pictures  some  of  which  resembled  the  stimulus  in  terms  of  content  and  all  of  which 
resembled  it  in  terms  of  color  and  print  quality. 

III.  ANALYSIS  AND  DISCUSSION  OF  PROTOCOLS 

A  priori  we  have  partitioned  the  knowledge  used  in  image  understanding  into 
three  broad  categories.  First  there  is  a  need  for  operators  to  define  the  various 
physical  attributes  (or  features)  in  a  stimulus  of  visual  kind.  Colors,  shapes,  locations, 
orientations  and  textures  of  objects  are  properties  that  can  be  easily  abstracted  and 
they  are  integral  parts  of  the  knowledge  we  use  to  understand  scenes.  We  call  this 
set  of  visual  concepts,  Feature  Extraction  Operators. 

The  second  category  of  knowledge  relevant  to  image  understanding  has  to  do 
with  how  we  translate  the  visual  information  captured  by  the  Feature  Extraction 
Operators  into  meaningful  physical  objects  or  images.  Suppose  we  look  down  and  see 
a  green  textured  surface  under  our  feet.  How  do  we  know  what  that  surface  is? 
Given  all  the  additional  knowledge  about  our  physical  context  (fresh  air  of  outdoors,  a 
supporting  surface  under  our  feet,  etc.)  and  the  visual  impulses  we  get  from  the 


surface  (color,  texture,  etc.)  we  infer  that  we  are  standing  on  grass.  Hence  using  the 
temporal  (outdoor,  support,  etc.)  as  well  as  the  general  (color,  texture  of  grass) 
Knowledge  we  conclude  that  the  surface  must  be  a  grassy  surface.  The  Knowledge 
that  enables  us  to  translate  individual  features  (context,  color,  texture)  into  object 
concepts  (i.e.,  grass)  is  called  Rewriting  Rules. 

Finally  we  use  the  Feature  Extraction  Operators  and  the  Rewriting  Rules 
deliberately,  or  in  a  nonrandom  fashion,  to  generate  the  desirable  conclusions.  For 
instance,  in  the  above  example  the  desirable  result  is  to  identify  that  the  surface 
under  our  feet  is  grass.  First  we  extract  certain  features  from  the  environment—  such 
as  green,  textured,  horizontal  solid  surface,  etc.  Next,  based  on  some  of  these 
features  and  using  the  appropriate  Rewriting  Rules  we  hypothesize  a  liKely  identity  for 
the  surface  under  our  feet.  Then  we  use  some  or  all  of  the  other  features  to  test  and 
verify,  or  reject,  or  modify  this  hypothesis.  This  continuous  process  of  hypothesising 
and  testing  (or  variations  there  oft)  constitutes  much  of  how  knowledge  can  be  used 
during  understanding  of  images.  We  call  the  Knowledge  of  activating  the  appropriate 
Feature  Extraction  Operators  and  Rewriting  Rules  to  achieve  this  understanding,  the 
Flow  Of  Control. 

The  hypothesize-test  paradigm  is  provided  here  as  an  example  of  a  kind  of 
control  flow.  If  is  by  no  means  the  only  one  one  should  take  into  account.  The  three 
knowledge  classes  outlined  above  are  provided  merely  to  structure  the  problem  area 
of  knowledge  used  in  Image  Understanding  into  manageable  subparts.  They  should  not 
be  seen  as  factors  biasing  our  analytical  findings. 

A  typical  protocol  is  provided  in  Table  1,  where  the  subject  works  with  the 
stimulus  in  Figure  1.  There  are  two  kinds  of  evidence  in  the  protocols;  one,  direct 
evidence  represented  by  the  subjects’  thoughts  about  their  own  behavior  in  the 
entries  entitled  "CONJECTURE”  and  "DRAW";  two,  the  indirect  evidence  where  a  series 
of  questions  and  answers  have  suggested  to  us  certain  behavioral  patterns.  A 
software  (PROTDO)  was  developed  to  achieve  consistency  and  objectivity  in 
interpreting  protocols.  (Akin  and  Schultz,  1976)  PROTDO  is  not  a  general  purpose 
analysis  program  but  rather  an  interactive  filing  system  equipped  with  special  search 
and  format  features  tuned  to  the  specific  tasks  of  this  investigation.  Both  categories 
of  evidence  from  the  protocols  shall  be  discussed  with  respect  to  all  three  sources  of 
knowledge  outlined  above,  Feature  Extraction  Operators,  Rewriting  Rules  and  Flow  of 
Control. 

A.  PERFORMANCE  OF  SUBJECTS 

Before  going  into  the  details  of  knowledge  acquisition  it  is  useful  to  judge  the 
level  of  understanding  each  subject  achieved  at  the  end  of  the  sessions.  All  subjects 
were  asked  to  describe  Ihe  picture  they  were  looking  at  in  their  own  terms,  after  each 
session.  Table  2  contains  the  complete  descriptions  provided  by  the  subjects. 

All  subjects  with  the  exception  of  Si,  S2,  and  S5  were  never  shown  the  set  of 
pictures  the  stimuli  were  selected  from  prior  to  the  experiment.  The  other  subjects 
were  familiar  with  these  images  by  virtue  of  their  daily  activities  in  the  A1 
Laboratories,  at  Carnegie-Melion  University.  However  this  does  not  present  any 


experimental  drawback  because  we  are  interested  in  getting  at  a  wide  range  of 
knowledge  applicable  in  Image  Understanding  rather  than  explain  the  problem  solving 
behaviors  of  the  subjects.  The  level  of  understanding  of  the  subjects,  with  prior 
knowledge  of  the  pictures,  did  not  differ  greatly  from  the  two  of  the  three  remaining 
subjects  at  the  end  of  the  sessions  anyway.  Further  more  they  had  to  work  for  it  just 
as  hard.  Subject  2  who  spend  about  twice  the  time  on  the  task  achieved  superior 
understanding  with  respect  to  all  subjects. 

All  subjects,  except  S3,  had  some  accurate  internal  representation  of  the 
contents  of  the  stimulus  image.  These  accounted  correctly  for  roughly  30  to  70 
percent  of  all  objects  in  the  scene,  depending  on  the  particular  subject  and  the  time 
spent  in  the  session.  The  reasons  for  S3’s  achieving  a  sub-standard  level  of 
understanding  lie  in  the  semantic  misunderstandings  that  filled  up  a  major  portion  of 
the  2  hour  session. 

On  the  other  hand  all  6  subjects  had  no  difficulty  in  visually  identifying  the 
scene  among  19  others  once  the  session  was  over.  Visual  recognition  occured  in  all 
instances  based  on  very  few  general  features  found  or  lacking  in  the  scene;  SI:  "not 
enough  blue",  SI:  "no  green  at  botlom."  These  features  are  based  on  low  level 
information,  i.e.,  color  distributions,  rather  than  high  level  concepts,  i.e.,  car,  building, 
etc.  Consequently  even  S3  who  had  no  idea  what  the  scene  contained  had  no  trouble 
recognizing  the  scene  after  the  session,  based  on  a  few  low  level  features. 

B.  KNOWLEDGE  ABOUT  FEATURE  EXTRACTION  OPERATORS 

First  we  scanned  all  protocols  for  physical  components  of  the  scene  that  were 
directly  named  by  the  subjects.  Six  levels  of  description  have  been  referred  to  by  the 
subjects;  scene,  cluster,  object,  region,  ( sub-region ,)  segment.  In  addition,  other  spatial 
and  representational  concepts  have  also  been  used;  i.e.,  point,  plane,  space,  2- 
dimensional,  and  3-dimensional.  (Table  3) 

Secondly,  all  features  refering  to  such  components  or  their  relations  have  been 
identified  in  the  protocols.  Seven  such  classes  of  features  have  been  observed: 
location,  size,  shape,  quantity,  color,  texture,  patterns  and  miscellaneous  others.  A 
complete  list  of  the  Feature  Extraction  Operators  is  provided  in  Table  3. 

Sixteen  different  classes  of  Feature  Extractors  were  used  to  indicate  locational 
relations.  These  were  delimiter,  above,  below,  adjacent,  around,  along,  far,  within, 
without,  center,  corner,  left,  right,  across,  vertical  and  horizontal.  Some  of  tjiose 
Operators  were  expressed  in  alternative  wording;  such  as,  separator  for  delimiter, 
higher  for  above,  surrounding  for  around  and  so  on.  Similarly  all  Feature  Extraction 
Operators  discussed  below  represent  classes  with  more  than  one  alternate  term  in 
each  class. 

All  common  geomelric  shapes  —  i.e., square,  circle,  triangle,  polygon,  trapgzoid — 
were  used  as  shape  Operators.  In  addition  some  not  so  common  shapes  were  also  used, 
such  as,  i- junction,  bifurcated.  Some  other  shape  properties  commonly  used  in  the 
protocols  dealt  with  linear  elements  and  their  combinations  such  as,  angularity, 
linearity,  curved,  flat,  convex. 


6 


Only  five  classes  of  Feature  Extractors  were  used  to  specify  size  in  the 
protocols.  These  were  large,  small,  long,  short,  and  ratio.  This  was  largely  due  to  the 
fact  that  the  subjects  were  given  information  about  the  metrics  of  the  various 
subparts  of  the  scene,  to  the  nearest  1/16  of  an  inch. 

Many  of  the  quantity  Operators  were  imprecise  concepts  such  as  some,  most, 
more  or  less,  few,  extent.  However  these  did  not  pose  difficulties  in  the  interaction  of 
the  subjects  and  the  experimenter.  Other  Operators  of  quantification  wero  more 
precise  in  the  sense  that  they  were  expressible  in  numbers  or  a  clear  criteria  for 
evaluating  them  existed.  These  were  whole,  any,  quadrant,  more. 

Color  was  the  Feature  Extraction  Operator  that  was  used  most  frequently  in 
labeling  regions.  Most  common  hues  were  used  extensively  by  the  subjects.  Also  the 
density,  contrast  and  texture  of  these  hues  wero  used  to  further  specialize  the  coding 
schemes  based  on  colors. 

All  of  the  Feature  Extractors  classified  under  patterns  indicate  some  property  of 
the  relationship  between  multiple  elements.  Usually  this  property  deals  with  the  rate 
or  nature  of  change  of  some  feature  between  different  subparts  of  the  scene.  For 
example,  some  of  these  pattern  Operators  are,  (in)homogeneity,  gradual,  abrupt,  same, 
varying,  continuous,  (ir)regular,  random,  mixed,  intersect  and  distribution. 

A  set  of  commonly  used  Feature  Extractors  did  not  seem  to  fit  readily  in  any  of 
the  above  categories.  These  were  categorized  under  miscellaneous  and  consisted  of 
approximate,  relative,  open,  complex,  basic  and  each. 

C.  KNOWLEDGE  ABOUT  REWRITING  RULES 

All  protocols  contain  many  instances  where  information  available  in  one  level  of 
scene  description  (feature,  segment,  region,  object,  cluster  of  objects,  scene)  is  used  to 
generate  information  in  a  different  level.  For  example,  a  feature  such  as  green  in 
color,  or  a  region  trapezoidal  in  shape  can  be  rewritten  as  grass  or  building  in  the 
object  level,  respectively.  These  elements  of  knowledge  used  in  rewriting  information 
available  in  one  level  of  scene  description  into  a  different  level  are  called  Rewriting 
Rules  (or  Hypothesis  Formation  Rules). 

Even  though  the  protocols  contain  many  instances  where  Rewriting  Rules  are 
used,  none  of  these  instances  contain  rules  that  are  explicitly  stated.  For  example, 
"Blueband  is  not  a  viewer  or  anything  flat";  "Probably  sky,  if  this  is  an  outdoor  scene"! 
"Maybe  we  have  a  road"  arc  some  direct  quotes  from  different  protocols.  All  of  these 
represent  inferences  made  about  the  scene  using  the  kinds  of  Rewriting  Rules  we  arc 
after.  The  existence  of  blue  is  used  to  infer  sky  in  the  above  example,  based  on  a 
rewriting  rule  such  as  "skies  are  usually  blue”. 

Needless  to  say,  these  Rewriting  Rules  can  only  be  infered  from  the  evidence 
present  in  the  protocols.  The  method  we  devised  for  identifying  the  Rewriting  Rules 
consists  of  an  interactive  protocol  transcription  system.  A  program  (PROTDO)  was 
written  to  sort  parls  of  the  protocol  into  «=ome  predetermined  categories  and  allow  the 
analyzer  to  fill  in  other  predetermined  categories  manually.  The  categories  used 


7 


consist  of  the  information  gathered  from  'each  answer  given  by  the  experimenter  and 
the  inferences  made,  based  on  the  cumulation  of  information  up  to  that  point  in  time,  as 
well  as  the  generation  and  testing  of  hypotheses.  Often  in  relating  the  information 

obtained  to  the  inferences  made  the  experimenter  had  to  deduce  the  appropriate 

Rewriting  Rules  that  were  possibly  used  by  the  subjects  in  between  the  two. 

Table  5  contains  a  sample  transcription  corresponding  to  the  fist  seven 

questions  of  the  protocol  in  Table  1.  The  Rewriting  Rules  are  labelled  appropriately  in 

Table  5.  Notice  the  code  provided  in  the  parentheses  after  each  rewiting  rule.  This 
code  indicates  the  origin  direction  and  destination  of  the  inference  enabled  by  that 
Rewriting  Rule  with  the  six  scene  description  levels.  For  example,  "Feature  to  Object" 
indicates  that  the  rule  rewrites  information  from  the  feature  level  into  the  object  level. 

Table  4  contains  a  complete  listing  of  all  Rewriting  Rules  observed  in  the 
transcribed  protocols.  Six  other  categories  of  use  are  identified  for  the  Rewriting 
Rules.  Some  Rules  are  used  as  assertions,  stating  the  existence  of  a  descriptor  at  the 
destination  level,  while  others  are  used  as  negatit/e  assertions  refuting  the  existence  of 
a  descriptor.  Context-free  rules  are  used  more  or  less  independent  of  prior  information 
about  the  scene,  while  conditional  rules  contain  a  priori  conditions  that  must  hold  SO 
that  they  can  be  applicable.  Finally,  generative  rules  are  used  to  hypothesize  and 
analytical  rules  are  used  to  test  these  hypotheses. 

1.  Assertions 

These  are  the  assignments  of  certain  descriptive  terms,  such  as,  red,  big,  car, 
grass,  etc.  to  one  or  more  components  of  the  scene.  Some  examples  are: 

Green  region  is  grass.  (Feature  to  Object) 

A  blue  region  may  possibly  be  the  sky.  (Region  to  Object) 

2.  Negative  Assertions 

These  indicate  that  an  assertion  does  not  hold  for  the  given  components  of  the 
scene  in  question.  At  first  this  sort  of  information  seems  to  be  useless  due  to  the  many 
degrees  of  freedom  there  are  in  identifying  the  component  being  examined.  However, 
negative  assertions  support  hypotheses  just  like  regular  assertions,  by  negation.  For 
example,  the  lack  of  a  certain  feature  may  support  a  hypothesis. 

Sky  and  distant  objects  of  similar  color  do  not  have 
contrast  edges.  (Cluster  of  Objects  to  Feature) 

3.  Context-free  Inference 

Some  inferences  seem  to  depend  on  previous  assertions  and  others  do  not.  The 
latter  arc  called  context-free. 

Perspective  distorts  shapes.  (Scene  to  Region) 

Grass  has  texture.  (Object  to  Feature) 


8 


4.  Conditional  Inferences 


The  inferences  that  can  be  made  only  in  the  existence  of  certain  assertions  are 
called  conditional  inferences: 

Trapezoidal  surfaces  are  the  faces  of  rectilinear  objects 
if  they  are  in  perspective.  (Region  to  Object) 

A  boundary  if  appropriately  positioned  with  respect 
to  a  road  may  indicate  that  the  road  may  have  multiple 
lanes.  (Scene  to  Cluster  of  Objects) 

5.  Generative  Inferences 

Inferences  which  are  used  to  generate  a  hypothesis  or  an  assertion  are  called 
generative.  In  the  case  of  the  picture  puzzle  paradigm  all  the  information  available  to 
the  subject  consists  of  low  level  scene  descriptors.  Hence  all  hypothesis  building 
based  on  this  information  works  in  a  bottom-up  fashion.  That  is  information  obtained 
in  the  low  levels  are  used  lo  hypothesize  objects  in  the  higher  levels  of  scene 
description.  For  instance: 

Low  contrast  edges  belong  to  very  distant  objects.  (Segment  to  Object) 

Longitudinal  linos  on  roads  are  the  divisions  indicating  multiple 
lanes.  (Segment  to  Object  to  Cluster  of  Objects) 


6.  Analytical  Inferences 

Inferences  which  are  used  to  test  an  already  generated  hypothesis  or  an 
assertion  are  called  analytical.  By  the  token  that  generative  inferences  usually  work 
bottom-up  analytical  inferences  that  test  the  hypotheses  generated  work  in  a  top- 
down  fashion.  That  is  a  hypothesis  generated  about  a  high  level  object  is  usually 
tested  by  verifying  the  existence  of  some  low  level  properties  of  the  object. 

Eyes,  or  eye-glasses,  may  look  like  two  adjacent  arcs.  (Object 
to  Segment) 

Man-made  objects  contain  repetitive  shapes.  (Object  to  Region) 

D.  KNOWLEDGE  ABOUT  FLOW  OF  CONTROL 


The  behavior  of  all  six  subjects'can  be  described  as  resembling  the  hypothesize 
and  test  paradigm  very  closely.  Whatever  the  current  focus  of  attention  a  subject  has, 
he  forms  some  hypothesis  about  what  the  property  of  one  of  the  scene’s  contents  is. 
Such  as,  "the  scene  is  indoors”  or  "there  is  a  car  in  the  scene";  or  "the  blue  region  is 
hilltops". 

Then  the  subjects  rewrite  the  hypothesis  into  a  testable  proposition  by 


WBBWnpaw  mmm 


identifying  come  properties  that  may  prove  or  refute  the  hypothesis.  For  example, 
using  the  above  examples  about  the  indoor  scene,  the  car  and  the  hill  a  typical  subject 
would  test  the  following  propositions: 

indoor  scenes  may  contain  large  wall  areas  that  may  be  white  or  off-white 
in  color. 

cars  have  shiny  round  accessories  which  occur  in  several  locations. 

a  sloping  surface  has  its  texture  getting  ft  „*r  as  it  moves  away  from 
the  observer. 

These  Rewriting  Rules  can  be  used  for  generating  testable  propositions  as 
shown  above  or  for  generating  hypotheses  about  the  scene.  Given  the  benefit  of  the 
answers  to  these  tests;  i.e.,  that  there  is  a  large  white  surface  or  shiny  round  regions 
or  texture  getting  finer  as  you  move  up  in  a  region  the  above  hypotheses  can  all  be 
generated  respectively. 

The  third  operation  subjects  seem  to  apply  is  related  to  the  assessment  of  the 
progress  they  are  making  in  performing  the  picture-puzzle  task.  Occasionally  the 
issues  at  hand  will  be  resolved,  elaborated,  or  abandoned  and  the  set  of  questions 
asked  will  exhibit  a  shift  in  the  attention  of  the  subject.  For  instance,  obtaining  the 
information  that  "there  aren’t  sufficiently  large  white  regions”  in  the  above  example 
(from  the  protocol  in  Table  1)  leads  the  subject  to  revise  his  hypothesis  about  the 
"indoor  scene."  Later,  the  subject  assumes  that  the  scene  is  "outdoors  with  a  sky", 
after  having  refuted  the  "indoors"  hypothesis.  Upon  the  rejection  of  this  hypothesis 
he  revised  the  "outdoor"  hypothesis  to  "outdoor  scene  with  an  occluded  sky". 

The  three  operations;  hypothesize,  test,  and  shift  of  attention  of  search  are 
iterated  throughout  all  six  protocols.  Each  of  these  operators  appear  in  different 
forms  as  illustrated  by  the  following  examples. 

f.  Generate  Hypothesis 

Naming:  After  acquiring  some  information  about  the  scene  the  subjects  seemed 
to  use  free-association  to  name  these  entities  as  familiar  objects.  For  example  after 
discovering  a  large  piece  of  grey  area  Subject  1  says  "Maybe  we  have  a  road."  The 
discovery  of  round  eye  -glass  like  objects  prompts  the  assertion  "People?"  from  the 
same  Subject. 

Backtracking  to  try.  a  different  hypothesis:  If  it  was  apparent  after  some 
examination  that  the  current  hypothesis  was  not  supported  by  the  evidence,  the 
subjects  proposed  the  opposite  of  the 'current  hypothesis.  For  example  if  the  subject 
was  testing  a  hypothesis  about  the  scene  being  an  outdoor  scene  he  would  soon  test 
whether  it  could  be  an  indoor  scene.  This  sequence  is  observed  in  Subject  1. 
Similarly  after  determining  the  orientation  of  a  surface  the  same  subject  reverses  his 
hypothesis  about  the  object  being  a  flat  object  and  starts  hypothesising  about  non-flat 
objects,  "Blueband  is  not  a  river  or  anything  else  flat.  Maybe  a  hill?” 


10 


In  cases  whore  a  likely  hypothesis  and  its  exact  opposite  were  tested  and  both 
were  not  supported,  subjects  proposed  less  probable  but  plausible  hypotheses.  For 
example  in  the  following  instance  the  subject  consideres  a  special  kind  of  outdoor 
scene  after  determining  that  the  scene  is  neither  an  indoor  nor  an  outdoor  (in  the 
normative  sense,  i.e.,  with  a  sky)  scene. 

Neither  outdoor  nor  typical  office  scene. 

How  many  regions  are  there  in  the  scene.  .  . 

Is  there  a  lot  of  green  in  the  picture.  .  . 

Aha,  maybe  outdoors  with  blocked  sky. 

Sub -goal  generation  in  the  presence  of  uncertainty:  In  order  to  deepen  the 
inquiry  about  a  region  it  can  be  decomposed  into  smaller  components,  using  uniformity 
of  at  least  one  property  as  the  criteria  of  decomposition.  More  often  the  the  criteria 
used  to  detect  uniformity  was  "color": 

Is  this  the  same  color  and  texture  throughout? 

In  the  region  to  the  left  of  the  biggest  circle:  what  is  the  color  and 

are  there  any  areas  significantly  different  in  color. 

2.  Tost  Hypotheses 

Exhaustiveness  of  Testing;  When  a  hypothesis  was  to  be  tested  the  subjects 
inquired  about  all  salient  visual  properties  of  the  hypothesized  component, 
exhaustively.  Most  subjects  start  out  identifying  a  particular  region  by  asking  about 
its  color,  shape  or  location  with  respect  to  a  known  region.  After  identifying  a  region, 
it  takes  on  the  average  3  to  4  more  properties  to  identify  before  that  region  can  be 
successfully  incorporated  in  the  total  understanding  of  the  scene.  Hence  a  total  of  4 
to  7  properties  are  explored  about  each  region.  And  this  is  almost  exhaustive  of  all 
classes  of  Feature  Extraction  Operators  used  by  the  subjects. 

Salient  feature  testing:  The  principle  of  exclusiveness  is  violated  under  certain 
conditions.  If  the  hypothesized  property  can  be  adequately  represented  by  one  major 
salient  feature,  the  presence/absence  of  that  feature  could  be  decisive  in  accepting  or 
rejecting  that  hypothesis.  For  example  consider  the  following  conjectures  made  by  the 
subjects: 

trapezoids  is  looking  lor  perspective  line.  . 

Emptiness  was  looking  for  maybe  number  of  areas, 

indicating  number  of  objects. 

Whole  and  its  Parts:  Most  entities  in  the  scene  are  parts  of  other  "things"  while 
they  arc  made  up  of  smaller  parts  too.  Usually  parts  and  wholes  are  related  to  each 
other  at  least  along  one  Feature  Extraction  dimension.  Salient  featuros  of  entities  can 
be  used  to  associate  spatially  unrelated  regions  as  parts  of  the  same  object  or  object 
cluster.  This  can  be  done  in  one  of  two  ways: 


f- 


One,  initially  unrelated  regions  may  be  parts  of  the  same  object.  Consider  the 
following  examples: 

Sudden  thought:  gray  lines  are  border  of  the  gray  bottom  objects? 

Are  the  two  reddish  brown  areas  (separated  by  the 
tannish  white  area)  connected? 


Two,  there  may  be  undiscovered  regions  in  the  scene  which  are  parts  of  the 
object  currently  being  examined.  As  shown  by  the  following  example: 

Maybe  this  is  the  front  of  a  car:  headlights,  etc. 

Let's  try  looking  for  some  circles. 

3.  F ocus  of  Attention  in  Search 

The  most  powerful  tool  of  the  subjects  seems  to  be  the  ability  to  deal  with 
incomplete  and  erroneous  hypotheses.  Given  any  arbitrarily  likelihood  of  success  for  a 
hypothesis,  the  subjects  can  operate  either  under  the  assumption  that  it  is,  or  is  not, 
true  until  in  fact  it  is  eventually  confirmed  one  way  or  the  other.  The  degree  of 
confidence  associated  with  an  assumption  seems  to  be  irrelevant  to  the  usefulness  of 
that  hypothesis  due  its  tentative  nature.  Given  a  state  of  information  about  the  scene 
the  subjects  have  the  options  of  pursueing,  abandoning,  accepting,  modifying  or 
striking  the  current  hypothesis. 


Discovering  the  gist  of  a  scene:  The  first  hypothesis  each  subject  deals  with  has 
to  do  with  the  issue  of  the  "gist"  of  the  scene.  All  first  five  subjects  ask  their  first 
questions  about  the  context  or  gist  of  the  scene. 

Are  there  colors  in  the  scene? 

Are  there  some  high  contrast  edges  in  the  middle  of  the  scene? 

Does  the  picture  contain  wide  open  space? 

I  assume  the  picture  is  representational? 

Is  the  photo  square  or  rectangular? 


The  only  exception  to  this  is  the  sixth  subject.  He  starts  by  dividing  up  the 
scene  into  quadrants  and  then  asks  about  the  general  line,  texture  and  intensity 
content  of  each  quadrant.  This  is  the  only  truely  bottom-up  approach  we  observed  in 
our  experiments. 

Use  of  salient  feature  in  th£  solution  at  next  issue:  The  selection  of  the  next 
issue  or  hypothesis  to  inquire  about  is  another  vital  aspect  of  Flow  of  Control.  Usually 


rrmrs 


12 


the  next  issue  selected  is  based  on  a  dominant  descriptor.  Subjects  inquire  about  the 
largest  region  or  the  one  with  most  contrast  edge  first.  The  following  examples 
illustrate  the  use  of  size  and  contrast  in  determining  the  significance  of  a  given 
region’s  property. 

Any  red  in  the  picture? 

Experimenter:  Yes,  two  reddish  areas  of  very  small  size. 

Forget  it.  (small  size  is  not  important) 

Describe  the  location  and  approximate  shape  of  largest 

homogeneous  region. 

Are  there  some  high  contrast  edges  in  the  middle  of  the  scene? 


On  the  other  hand,  if  an  altogether  new  issue  is  necessary,  a  dominant  property 
based  on  the  current  gist  of  the  scene,  such  as  locational  adjacency,  etc.,  can  be  used 
to  explore  new  regions.  All  three  examples  below  illustrate  the  use  of  adjacency  in 
selecting  the  next  region  for  examination. 

Concentrate  next  on  the  lower  edge  of  the  sky. 

I’ll  try  working  from  the  boarder  inward. 

Work  by  process  of  elimination  from  the  edges. 

When  there  war.  no  guidance  from  the  gist  assumptions,  the  subjects  went  back 
to  unresolved  issues.  For  example,  Subject  l  says  Took  at  the  supposed  road,"  after 
dealing  with  another  issue  for  a  while;  similarly  Subject  2  notes  ”1  don’t  feel  any  need 
to  continue  at  this  point  in  the  lower  region  of  the  scene.  .  .  .back  to  get  more  detail 
on  blueband  and  green  region," 

Similarly,  when  a  new  piece  of  evidence  emerges  from  an  inquiry  and  it  cannot 
be  simply  accommodate  by  the  current  assumptions  about  the  scene,  this  issue  can  be 
set  aside  for  future  exploration.  Subject  2  notes  "Interesting,  but  come  back  to  this 
later.  " 

Resolution  of  hypothesis:  Normally  all  hypotheses  get  resolved  after  a  number  of 
repeated  inquiries  about  it.  Sometimes  it  takes  to  come  back  to  an  issue  after  other 
issues  have  been  resolved.  This  is  inevitable  due  to  the  conditionality  of  Rewriting 
Rules  in  general.  For  example,  a  blue  region  in  a  "sea”  scene  is  likely  to  be  the  sea 
and/or  the  sky,  while  in  a  "portrait"  scene  it  is  likely  to  be  a  piece  of  garment  or  a 
background  surface.  And  these  issues  can  be  resolved  after  determining  the  context 
of  the  scene  and  relative  location  of  these  regions  with  respect  to  others. 

On  the  other  hand  some  hypotheses  can  not  be  resolved  by  using  the  most 
probable  associations.  In  such  cases  some  rarely  used  Rewriting  Rules  are  used  to 
justify  some  of  the  findings  in  spite  of  some  apparent  contradictions.  Consider  the 
following  examples: 


13 


**s%*r****^  •  ■****&++ 


I’m  beginning  to  think  this  whole  thing  is  a  Kandinsky  painting. 

Almost  certainly  a  city  scene.  Still  puzzled  by  the  low  contrast 
bottom  edge  of  blueband,  the  green  region  within  blueband,  and 
identity  of  blueband.  Perhaps  it's  sky  also  and  its  different 
color  is  because  of  pollution  of  sunset. 

Don't  know  what  they  are?  Blue  clouds  or  clouds  which  look 
blue  because  of  lighting. 

IV.  CONCLUSIONS 

This  study  attempts  to  specify  knowledge  used  in  an  image  understanding  task 
by  human  subjects.  In  order  to  achieve  this,  a  picture-puzzle  paradigm  has  been 
developed. 

The  first  type  ot  evidence  provided  by  the  protocols  is  the  knowledge  about 
possible  primitive  Feature  Extraction  Operators.  The  range  of  operators  seem  to  be 
modest,  yet  we  suspect  this  was  caused  by  intrinsic  properties  of  the  picture-puzzle 
paradigm.  Translation  of  visual  information  into  the  verbal  domain  may  have  taken 
away  from  the  richness  of  the  visual  information.  Yet  we  believe  that  the  operators 
represent  a  desirable  subset  in  any  system  for  computer  understanding  of  scenes. 

The  second  type  of  evidence,  i.e.,  knowledge  about  the  Rewriting  Rules  found  in 
the  protocols  seems  natural  and  appears  to  be  easily  implementable  through 
production  system-like  schemes.  While  this  set  of  rules  is  not  intended  to  be  complete 
and  exhaustive  they  provide  a  good  beginning  for  analysis. 

The  Flow  of  Control  found  in  the  protocols  reveal  some  general  techniques  that 
already  appear  to  be  useful  in  "blackboard*’  model-like  schemes.  (Erman  and  Lesser, 
1976)  Further  experience  from  our  laboratory  indicates  that  different  tasks  used  with 
the  picture-puzzle  paradigm  require  different  Flow  of  Control  mechanisms.  We 
recommend  special  attention  to  task  properties  in  all  studies  so  that  an  optimal  task 
specific  control  strategy  might  be  utilized. 

V.  ACKNOWLEDGEMENTS 

We  wish  to  thank  Ron  Ohlander  and  Keith  Price  for  the  many  hours  of  work 
spent  on  the  experimental  tool  and  Herbert  Simon,  John  Gaschnig,  Adrian  Baer,  Bruce 
Lowerrc,  Randy  Korman,  and  John  Render  for  the  valuable  protocols  they  provided. 


14 


REFERENCES 


1.  0.  Akin  and  M.  Schultz,  An  Interactive  Protocol  Analysis  System  for  Knowledge 

Acquisition  in  Image  Understanding,  Computer  Science  Department  Reports, 
Carnegie-Mellon  University,  1976. 

2.  G.  W.  Baylor,  A  Treatise  on  the  Mind's  Eye:  an  Empirical  investigation  of  Visual 

Mental  Imagery.  Ph.D.  Thesis,  Carnegie-Mellon  University,  1971. 

3.  G.  T.  Buswell,  How  People  Look  at  Pictures,  University  Press,  Chicago,  1935. 

4.  L  D.  Erman  and  V.  R.  Lesser,  A  multi-level  organization  for  problem  solving  using 

many,  diverse,  cooperating  sources  of  knowledge.  Working  Papers  In  Speech 
Recognition  IV,  Carnegie-Mellon  University,  February  1976. 

5.  A.  M.  Farley.  VIPS:  A  Visual  Imagery  and  Perception  System:  the  Results  of  a 

Protocol  Analysis,  2  volumes.  Ph.D.  Thesis,  Carnegie-Mellon  University,  1974. 

6.  0.  Furr.chein  and  M.  A.  Fischler,  A  study  in  descriptive  representation  of  pictorial 

data,  Pattern  Recognition,  4,  1972. 

7.  A.  Guzman,  Computer  Recognition  of  Three-Dimensional  Objects  in  a  visual  scene, 

MAC-TR-59,  Ph  D.  Thesis,  MIT  Project  MAC,  1968. 

8.  J.  Hochberg,  Perception.  Prentice  Hall,  New  Jersey,  1964. 

9.  J.  Hochberg,  In  the  mind’s  eye.  Contemporary  Theory  and  Research  in  Visual 

Perception  (R.  N.  Haber,  Ed.)  Holt,  New  York,  1968. 

10.  B.  Julesz, Foundations  of  Cyclopean  Perception,  University  Press,  Chicago,  1971. 

11.  M.  D.  Kelly,  Visual  Identification  of  People  by  Computer,  AIM-130,  Ph.D.  Thesis, 

Stanford  University,  1970. 

12.  G.  R.  Loftuo,  A  framework  for  a  theory  of  picture  recognition,  presented  at  the 

National  Academy  of  Sciences  Specialist’s  Meeting  on  Eye  Movements  and 
Psychological  Processes,  Princeton,  New  Jersey,  April  1974. 

13.  T.  P.  Moran,  The  symbolic  nature  of  visual  imagery,  Proceedings  of  Third 

International  Joint  Conference  on  Artificial  Intelligence,  Stanford,  California, 
1973. 

14.  A.  Newell,  On  the  analysis  of  "human  problem  solving  protocols,  Calcul  et 

Formalisation  dans  les  Sciences  de  L' Homme  ed.  by  J.  C.  Gardin  and  B.  Jaulin, 
146-185,  Centre  National  de  la  Recherche  Scientifique,  Paris,  1968. 

15.  R.  B.  Ohlander,  Analysis  of  Natural  Scenes,  Ph.D.  Thesis,  Carnegie-Mellon 

University,  April  1975. 


15 


16.  R.  B.  Ohlander,  R.  Reddy  and  0.  Akin,  An  Experimental  System  for  Knowledge 

Acquisition  in  Image  Understanding,  Computer  Science  Department  Reports, 
Carnegie -Mellon  University,  1976. 

17.  R.  Reddy  and  A.  Newell,  Image  Understanding:  potential  research  approaches. 

Unpublished  report,  Computer  Science,  Carnegie-Mellon  University,  1975. 

18.  R.  N.  Shepard,  Recognition  memory  lor  words,  sentences  and  pictures,  Journal  of 

Verbal  Learning  and  Verbal  Behavior,  6,  1976. 

19.  D.  L.  Waltz,  Generating  Semantic  Descriptions  from  Drawing  of  Scenes  with 

Shadows,  A!  TR-27 1,  M1T-AI  Laboratory,  November  1972. 

20.  W.  A.  Woods  and  J.  Makhoul,  Mechanical  inference  problems  in  continuous  speech 

understanding,  Artificial  Intelligence  5,  1974. 


16 


TABLE  1  Sample  Protocol  from  Task  1. 


U  U 

H  0 


ii  *:  i 

H  I  |  I 

J|{  is  5  *  I 

III  ft  f  f  \ 

ill  if  |  n  I 

ill  !  i  i  i  | 

ifs  !i  H  ;? 

Sil  Si  Si  Si  Si 


5  JjiT 

|I;s 

i  jii 

l:H 


J  l 

hi  i 

I  ;  §  i 
ill  « 

|J  |  1  l 

fl?l  l  f! 

iii§  !  §i 

f  I  *u  I  Jo1? 

if  I  Sit 

*«Sb  *|r8 

Si  Sli 


-  t  ! 

5  !  1 

i  |  JJ  i 
i  £  f  -;j  i 

{  f  f  ill  *  ; 

I  |a  i  £  V  s  I 

iiiUl 

i  i  I  111  i  t 


£  *  K 
*  1: 

i  i| 


a  a 

J  2  II 


*  Ht 


H.M 

Hi  hi 


l 

I . 

I  |l 

i 


I  II 
ili,  i,i 


jir  5 

hs  : 

i“!  i 


i  ►  :  -i  8 

I  I  t  l!  l 

I  2  «  J 


§  Si  Si  Si  Si  Si  Si 


I  i  I  ii 
M  Ml 


Si  Si 


17 


TABLE  2.  SUBJECTS’  UNDERSTANDING  OF  IMAGES. 


Sl>  Car  5c*n*  (2 JO)* 

It  la  outdoor*.  Thor*  ipptoi  to  b*  *  rood  with  o  cor.  [Con  you  idontify  othor  thing* 
Or  objects  othor  thon  th*  ono*  you  mentioned*]  Thor*  othor  objects:  I  |U0U  thoy 
or*  mon-mod*.  It  thoy  or*  on  th*  cor,  th*y  *r*  headlights  or  othor  m*chantc*t  part*. 
Thor*  i*  ■  lot  ot  gras*  and  <om*  ground. 

S2t  Downtown  Scon*  (4:35) 

Thor*  i*  ■  bluo  shy  at  th*  top  ot  th*  picture,  bordering  th*  ridg*  ot  *  hill,  beyond 
which  on  th*  right  on*  con  sea  th*  rldgot  ot  two  mor*  diotont  mills.  Th*  main  hill  I* 
gr**n-gr*y-bluiih  colored,  turning  to  l*t*  biu*  and  mor*  gray,  gr**n  and  brownish- 
rod  as  on*  go*t  down  th*  hill.  On  th*  hill  it  a  mor*  gr**nith  r*gion  within  which  or*  5 
or  6  thin  horizontal  short  strips  ond  tom*  light -colorod  spots,  t  conjecture  that  they 
or*  buildings  and  othar  arlitacts,  might  b*  a  housing  development.  In  th*  foreground 
I*  •  city  tcap*,  probably  downtown  with  many  tall  buiidmgt.  Th*  buildings  occupy 
mor*  than  60  percent  of  this  lower  portion  of  th*  seen*,  much  ot  th*  rest  ot  th*  low*r 
(foreground)  consists  of  a  pond  of  wat*r  on  th*  left  near  th*  largest  of  th*  building* 
and  another  pond  in  th*  center  bottom  [wrong]  ot  th*  scan*,  and  a  strip  of  groan, 
probably  just  grass  airlanding  loft  to  th*  c*nt*r  of  th*  scan*  in  th*  lower  portion  of 
th*  city  scape. 

S3:  Offic*  Seen*  (2:35) 

(No  high-level  concepts  formed.] 

$4:  Houa*  Sc*n*  (2:45) 

Th*  horicontal  streak  with  whit*  lin*s  is  •  str**t  or  road.  Th*  road  pass**  In  front  of 
aom*  buildings  —  or  billboards.  I  think  it  is  a  landscape  with  blu*  sky  above  and 
mainly  gr**n  grass  and  shrubs  below.  There  is  a  bush  of  some  sir*  in  the  lower  left 
and  a  tree  in  th*  foreground  to  th*  right  of  center.  A  street  or  road  across  th*  scan* 
horizontally.  I  have  no  good  hypotheses  about  th*  natur*  of  th*  small  rectangle* 
(sine*  they  ar*  flat,  not  solid).  Th*  vertical  objects  with  rounded  top*  could  b*  silo*.  I 
hav*  no  idea  what  the  w*dg*-*hap*d  objects  ar*. 

S5:  Bear  Scene  (2:15) 

Facts  I  know  for  sure:  1)  There  is  a  very  large  dark  area  in  th*  center  of  th*  picture. 
2)  This  *r*a  is  roughly  p*ar  (or  b*ll)  shaped  and  teams  to  h*v*  s  bit  ot  th*  *r*a 
•standing  lowar  than  th*  rest.  3)  Th*  background  contains  many  brown,  gray,  end  tan 
areas  that  ar*  contusing.  4)  Thera  ar*  som*  lighter  areas  within  th*  central  dark 
region:  two  whit*  areas,  on*  m  th*  canter  and  th*  other  near  th*  right  top.  Th*rp  I* 
also  .a  cluster  ot  smaller  patches  in  th*  cantrsl  lower  portion  of  the  dark  area. 

Things  I  think  (  know:  1 )  It  is  s  picture  of  •  b**r  sitting  upright  with  his  right  hind  log 
folded.  .2)  1  believe  h*  is  facing  to  his  left  Oh#  whits  *r* a  may  b#  hit  no**),  but  I'm 
not  euro.  3)  Th*  contusing  background  is  rocks  (from  s  tool 

SC:  Portrait  (2:50) 

Th*  image  appears  to  b#  that  of  a  man  sitting  or  standing  erect  and  facing  frontally. 
Behind  him  is  an  undifferentiated  field  el  gray/whit*.  Th*  man  stems  to  havo  much 
hair  Including  a  beard,  and  may  bt  wearing  eyv-glassat.  Another  possibility  is  that  it 
I*  o  women  with  her  hair  partially  draped  over  her  tset  or  possibly  it  is  e  picture  ot 
an  ape.  But  small  arcs  suggest  otherwise.  Finally,  large  arch  is  suggestive  of  some 
clothing  artifact  or  subject  is  possibly  holding  soma  object  in  front. 

*Total  time  (in  hours)  the  subject  was  allowed  la  work  an  the  session. 


18 


TABLE  3.  FEATURE  EXTRACTION  OPERATORS 


TABLE  4.  REWRITING  RULES 


TABLE  4.  REWRITING  RULES 
Continued. 


21 


An  Experimental  System  for  Knowledge  Acquisition  in 
Image  Understanding  Research 


R.  Ohlander,  R.  Reddy  and  0.  Akin* 
Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania  15213 


A  class  of  problems  related  to  model  building  in  cognitive  psychology  and 
artificial  intelligence  require  the  codification  of  Knowledge  sources.  Some  cognitive 
tasks  of  interest  which  fall  in  this  category  are  perception  of  speech,  perception  of 
visual  scenes,  or  perception  of  other  symbolic  media  such  as  maps,  drawings,  and 
written  text. 

Various  experimental  tools  that  have  been  developed  until  now  can  be 
categorized  into  three  classes:  eye  fixation  studies,  protocol  analysis  of  mental  imagery 
related  tasks,  and  protocol  analysis  with  controlled  exposure  of  stimuli. 

Eye  fixation  studies  have  yielded  specific  information  on  the  feature  selection 
processes  in  perception.  (Buswell,  1935;  Loftus,  1974;  Mackworth  and  Morandi,  1976) 
However,  due  to  a  lack  of  theoretical  models  of  the  image  understanding  process,  these 
studies  have  not  led  to  a  codification  of  knowledge  sources  used  in  picture  processing. 

Protocol  analysis  studies  of  tasks  with  imagined  visual  objects  provide 
theoretical  models  of  image  processing.  (Baylor,  1971;  Moran,  1973)  However,  like 
protocols  based  on  limited  exposure  these  do  not  provide  direct  evidence  on  the 
knowledge  sources  used  in  these  tasks.  (Farley,  1974;  Potter  and  Levy,  1969) 
Basically,  this  is  true  because  these  tasks  were  not  designed  to  explore  sources  of 
knowledge. 

In  this  paper,  we  propose  an  experimental  paradigm  which  is  designed  to 
explore  the  knowledge  sources  used  in  visual  understanding  tasks.  Our  a  priori 
taxonomy  for  knowledge  needed  in  image  understanding  is  made  up  of  three  parts: 
Feature  Extraction  Operators,  Rewriting  Rules,  and  Elements  of  Control  Flow.  Feature 
Extraction  Operations  are  based  on  visual  properties  found  in  scenes,  such  as  color, 
shape,  location,  size,  quantity,  texture,  etc.  that  can  be  used  to  decompose  a  scene  into 
sub-parts  and  then  label  and  characterize  these  sub-parts. 

Rewriting  Rules  enable  the  translation  of  these  low-level  attributes  into 
meaningful  visual  components  --  grass,  chair,  table,  room,  etc.  --  and  vice  versa. 
These  components  can  be  expressed  as  elements  of  various,  hierarchical  levels  of 
scene  description.  For  example,  the  color  "green"  when  supplied  as  a  low-level 
information,  may  help  to  infer  a  "leaf"  at  a  higher  level,  or  a  "forest"  at  a  yet  higher 
level.  The  flow  of  control  governs  the  use  of  Feature  Extraction  Operators  and 
Rewriting  Rules  in  the  context  of  a  specific  goal-directed  visual  task.  Elements  of 


*  Also  in  the  Department  of  Architecture. 


26 


Control  Flow  arc  helpful  to  develop  alternative  scene  descriptions  and/or  test  such 
descriptions  in  order  to  generate  a  final,  unique  description  of  the  scene. 

The  "picture-puzzle"  paradigm  we  developed  aims  to  provide  direct  evidence  for 
all  three  classes  of  Knowledge  cited  above.  Further,  it  provides  a  simulation  of  the 
process  of  machine  understanding  of  visuat  scenes.  The  task  of  the  subjects  is  to 
describe  the  scene  including  the  parts  of  the  visual  scene,  based  solely  on  verbal 
question-answer  interactions  with  the  experimenter.  The  experimenter  can  answer 
questions  concerning  lower  levels  of  scene  description,  only.  For  example,  he  is 
allowed  to  say  that  there  is  a  "green  region"  with  certain  texture,  size,  location,  shape, 
etc.  However,  he  is  not  allowed  to  say  that  there  is  "grass"  in  the  scene. 

In  conventional  experimental  conditions  where  subjects  interact  directly  with  a 
visual  scene  or  image,  the  inferences  made  during  the  analysis  of  the  data  are  either 
based  on  unobtrusive  recordings  of  subjects  behavior  (eye  fixations,  reaction  times)  or 
the  introspections  of  subjects  about  their  own  behavior  during  the  task  (protocols). 
Figure  1  represents  the  flow  of  information  in  the  conventional  case.  Eye  fixation  and 
reaction-time  information  provides  very  little  in  terms  of  knowledge  used.  A  major 
problem  with  protocols  of  self-assessment  is  the  loss  of  much  of  what  is  internally 
processed. 

Ideally,  the  experimenter  needs  to  have  first  hand  experience  in  monitoring  or 
observing  the  interactions  of  the  subjects  with  the  stimulus.  The  picture-puzzle 
paradigm  achieves  the  monitoring  of  the  interaction  adequately.  Figure  2  indicates  the 
schematic  interaction  between  the  subject,  stimulus  and  the  experimenter.  All 
interactions  between  the  subject  and  the  stimulus  go  through  the  experimenter  in  the 
case  of  the  picture-puzzle  paradigm. 

I.  APPLICATIONS 

Two  video-terminals  were  used  in  the  experiments.  The  terminals  were 
connected  to  each  other  by  means  of  software  (TALKK  program)  to  enable  typed-in 
communication  between  their  users.  The  facilities  of  the  Computer  Sciences 
Department  at  Carnegie-Mellon  University  were  used  to  accommodate  this  set-up.  The 
terminals  were  located  such  that  no  visual  communication  between  their  users  was 
possible  (Figure  7  of  the  first  paper  in  this  volume,  entitled  "Knowledge  Acquisition"). 
The  subject  and  the  experimenter  were  able  to  communicate  only  verbally  thru  the 
TALKK  program. 

TALKK  was  designed  to  record  all  statements  made  by  both  the  subjects  and  the 
experimenter  throughout  the  experiments.  It  also  enabled  them  to  input  conjectures 
and  notes  about  the  task  at  any  time  during  the  experiment.  It  further  enabled 
subjects  to  correlate  their  personal  notes  and  drawings,  which  they  were  allowed  to 
make  on  separate  sheets  of  paper,  with  the  typed  in  protocol  recorded  by  the  TALKK 
program. 

The  stimuli  used  were  produced  for  and  used  in  automated  image  understanding 
research  by  Ohlander.  (1975)  All  the  scenes  were  constructed  or  selected  as  usual 
natural  images  of  very  familiar  objects.  All  the  stimuli  in  the  first  six  figures  of  the 
first  papor  in  this  volume  have  been  used  in  this  experiment. 


27 


1 


The  subjects  were  simply  instructed  to  "understand"  the  contents  of  the  stimulus 
so  that  they  would  be  able  to  describe  all  major  objects  in  the  scene.  The 
experiments  were  terminated  when  the  subjects  thought  they  understood  all  major 
objects  or  at  the  end  of  2  and  1/2  hours,  which  ever  came  first.  The  subjects  were 
required  to  perform  the  experimental  task  with  any  one  of  the  six  different  stimulus 
scenes. 

Due  to  the  fact  that  the  experimental  paradigm  used  here  is  totally  novel,  at 
least  to  our  knowledge,  it  deserves  a  careful  reconstruction  of  its  proceedings  for 
clarity.  We  suggest  that  the  reader  go  over  the  sample  protocol  (in  Table  1  of  the 
first  paper  in  this  volume,  entitled  “Knowledge  Acquisition")  in  which  the  subject  tries 
to  "understand"  the  given  image  (in  Figure  1  of  the  same  paper).  Note  that  the  "DREW 
*"s  in  the  protocol  refer  to  the  personal  notes  of  the  subject  indicated  by  numbers  in 
Figure  5. 

Since  it  was  one  of  the  independent  variables  being  examined,  the  range  of 
operators  used  in  inquiries  by  the  subject  were  not  limited.  However,  when  the 
subjects  used  high  level  descriptors  (which  were  defined  as  illegal  questions  at  the 
onset  of  the  experiments)  to  inquire  about  the  scene,  the  experimenter  refused  to 
understand  the  question,  this  forced  the  subjects  to  reformulate  their  questions 
causing  them  to  use  low  level  descriptors  only.  Subjects  were  urged  throughout  the 
experiments  to  put  down  their  conjectures  about  the  task. 

All  verbalizations  from  the  protocols  were  automatically  recorded  by  the 
software  used.  Five  types  of  entries  constituted  a  protocol;  questions,  conjectures  and 
personal  notes  of  subject;  and  answers  and  notes  by  the  experimenter. 

II.  RESULTS  AND  DISCUSSION 

The  protocols  have  provided  direct  and  indirect  accounts  of  three  kinds  of 
knowledge  sought  at  the  onset.  (Akin  and  Reddy,  1977)  Aside  from  these,  three  things 
have  been  accomplished  by  the  experimental  method  used.  One  is  the  ability  of 
bredth-first  exploration  of  the  problem  space.  Unlike  other  studies  --  eye  fixation 
studies,  specific  task  environments  with  simple  visual  stimuli  —  a  broad  base  of  issues 
of  visual  processing  are  tackled,  simultaneously.  This  enables  the  acquisition  of  a 
general  view  of  a  large  problem  space  and  the  cross-cultivation  of  the  knowledge 
about  all  major  issues  being  explored. 

Secondly,  the  very  fast  process  of  visual  perception  is  slowed  significantly 
enabling  the  subjects  to  generate  richer  data.  The  paradigm  developed  here  is 
intended  to  aid  model  building  in  artificial  intelligence  more  so  than  exploring  the 
issues  of  cognitive  psychology.  Therefore,  the  fact  that  it  places  the  natural  process 
of  visual  understanding  into  a  form  of  problem  solving  does  not  present  a  problem. 
Finally,  the  slowed  down  process  of  unraveling  the  scene  is  channeled  through  the 
experimenter,  enabling  a  rich  amount  of  data  to  be  recorded. 

In  addition  to  the  general  scene  understanding  task  reported  above,  various 
other  tasks  have  been  tried  using  the  same  experimental  paradigm:  finding  a  landmark 
(target)  in  a  scene;  navigating  the  experimenter  on  a  path  in  a  scene;  and  detection  of 
change  between  two  scenes  with  similar  contents. 


28 


Finding  a  landmark  (target)  in  a  scene:  The  subjects  are  briefed  on  e  map 
(Figure  3)  of  the  area  contained  in  the  scene.  They  are  told  what  the  scene  contains 
and  are  required  to  locate  and  identify  a  specific  target  in  it.  (Table  1)  The  same  hinds 
of  Feature  Extraction  Operators  and  Rewriting  Rules  have  been  observed  in  this  teak 
as  in  the  original  picture-puzzle  task.  However,  the  Flow  of  Control  reflects  unique 
patterns  Special  knowledge  sources  for  translating  two  different  representations  of 
the  same  scene  (from  photograph  to  the  map  and  vice  versa)  into  one  another. 

Nox/igating  tho  experimenter  on  a  path:  The  subjects  are  briefed  about  what  the 
scene  contains  and  are  required  to  find  a  path  for  navigation  around  an  obstacle.  The 
scene  used  was  a  suburban  house  scene  and  the  obstacle  was  the  house  itself.(Figure 
4  of  the  first  paper  in  this  volume,  entitled  “Knowledge  Acquisition")  Here  special 
knowledge  sources  for  translating  the  functional  requirements  of  navigation  into  spatial 
terms  are  used  in  the  protocols.  (Table  2) 

Detection  of  change  in  two  scenes  with  similar  content:  This  experiment  aims  to 
simplify  the  original  task  dliminating  detailed  examination  of  the  scene  all  together. 
Instead  of  requiring  subjects  to  determine  the  nature  and  the  contents  of  a  scene  the 
task  requires  subjects  to  match  two  photographs  with  slightly  different  contents.  For 
example,  the  subject  is  told  that  there  are  two  photographs:  one  representing  a 
central  business  district  of  a  large  city  (Figure  2  of  the  first  paper  in  this  volume, 
entitled  "Knowledge  Acquisition”)  and  the  other  representing  an  urban  industrial  sector 
of  the  same  city  (Figure  4).  The  task  was  to  identify  each  photograph  based  on  this 
distinction.  (Table  3)  This  task  enabled  the  exploration  of  only  a  subset  of  the  original 
task,  i.e.,  discovering  the  nature  of  the  scene,  independent  of  a  detailed  exploration  of 
the  scene’s  contents. 

The  experimental  paradigm  explored  here  provide  new  means  of  exploring  the 
knowledge  acquisition  process  in  image  understanding  tasks  .  We  have  cited  some 
variations  of  the  paradigm  above.  These  examples  however  are  not  exhaustive  of  all 
of  its  possible  uses. 


III.  ACKNOWLEDGEMENT 

We  thank  Keith  Price  for  the  valuable  help  he  provided  by  programming  TALKK 
that  enabled  the  automatic  recording  of  the  protocols  in  the  picture-puzzle  experiment. 


29 


REFERENCES 


1.  Akin,  0.  and  Reddy,  R.  Knowledge  acquisition  (or  image  understanding 
research.  Journal  of  Computer  Graphics  and  Imago  Processing,  1977  (in  print). 

2.  Baylor,  G.  W.  A  Treatise  on  the  Mind's  Eye:  an  Empirical  Investigation  of 
Visual  Mental  imagery.  Ph.D.  Thesis,  Carnegie-Mallon  University,  1971. 

3.  Buswell,  G.  T.  How  People  Look  at  Pictures,  University  Press,  Chicago,  1935. 

4.  Farley,  A.  M.  VIPS:  A  Visual  Imagery  and  Perception  System:  the  Results  of 
a  Protocol  analysis,  2  vols.  Ph.D.  Thesis,  Carnegie-Mellon  University,  1974. 

5.  Loftus,  G.  R.  A  Framework  (or  a  theory  ot  picture  recognition,  presented  at 
the  National  Academy  o (  Sciences  Specialist’s  Meeting  on  Eye  Movements  and 
Psychological  Processes,  Princeton,  N.  J.,  April,  1974. 

6.  Mackworth,  N.  H  and  Morandie,  A.  J.  The  gaze  selects  inlormative  details 
within  pictures,  Perception  and  Psychophysics.  1967,  2,  547-551. 

7.  Moran,  T.  P.  The  symbolic  nature  of  visual  imagery,  Proceedings  of  the 
Third  International  Joint  Conference  on  Artificial  Intelligence,  Stanford,  CALIFORNIA, 
1973. 


8.  Ohlander,  R.  B.  Analysis  of  Natural  Scenes,  Ph.D.  Thesis,  Carnegie-Mellon 
University,  April,  1975. 

9.  Potter,  M.  C.  and  Levy,  E.  I.  Recognition  Memory  for  a  rapid  sequence  of 
pictures,  Journal  of  Experimental  Psychology,  1969,  81,  10-15. 


3 


TABLE  1  Sample  Protocol  from  Task  2. 


S  * 

I  ! 


ssgSui 

5ss 


s  §  t 

l  i  5 

§  I  I 

*  ft  * 

I  M 

i  Slg 

i  i>c 9 

i  sil 

8  S£  > 

w  t-  r* 

s  z  e  < 
w  i  e  e 

=  £*£ 

I  S  =  B 
S  5  *  * 
a  Si? 

i  5  C  - 

E!  JuStt 


S?3  = 

g2i  -  I 

ibl  § 

i 


*5o|-Si 

s»g*&5 

«|bg* 

j|S5« 
k  ^  tr  ®  S 

-  If.  “  *  £ 

f  g  * *S  « 

£§  =  sg 
Ss||| 
sSsfg 


gS  £0! 

5*'  3  X 


!  §  ^  5  §  5  ®  j£  j£ 

;  iS^Sflii 
§s!sls>» 
hihilt 

ISyjfegfa! 

'*£*« *  =  S  i 

Swi'ISf"  * 

*$  5  c  £  n  ►-  s  « 
iSau^oSg 
&B^<feS<lE5 

$ 


?si  i 

is*  * 


I  i 

L  i 

2«  5 

we  S 
s  «*  * 

*  ?  2 

sJ;  S 
a  *  = 


S5|  I 

fc  O  2  5 
*  £  »  8 

2?s  s 

g5T  $ 

w  «  S  R 


5* 

ii 

iis& 

s 

* 

*5  ft 

s*  | 

tf  2  S§ 

CS| 

i~£ 

"C  j 
£|  * 

u; 

tfi 

< 

uj 

e  L 

ii 

$  -j  X 
ft  tt  tel 

2  Ji  Sc 

«A  C.  W 

i  s 

Is! 

ig 

72 

CJ 

(3) 

w*S-§ 
£*S  *■ 

=  ?^- 
s:a  =  S 

5 1 1 

i/>  >-  a 


B<«1  . 

|l|| 

Isif 

w  w  *  V 

«  <  1  L 


Sjt  SB* 

its  t  i 


TABLE  3  Sample  Protocol  from  Task  4. 


SUBJECT 


visual  \  STIMULUS 
perceptor/ 


FIGURE  I 


Information  Flow  in  Visual  Image  Understanding. 


SUBJECT 


,efbal  SJFXPERIM'R. 


visual 


STIMULUS 


!#l 


•j  </•+*&*<■•■  j  $lyifLj  «  &  nix  j \)4*+*t\  p*hlt4 

<9^  ~f£dLA  ■  bi&&  t  &h£t 


iM 


t\\jL 


I  ' 


cv<k 


0 


<?.  CjiOj 


|^W|Mf*'MMM 


An  Interactive  Protocol  Analysis  System  for  Knowledge  Acquisition 


Onier  Akin*  and  Marty  Schultz 
Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pa.  15213 


An  experimental  paradigm  for  exploring  the  use  of  knowledge  in  understanding 
images  has  been  developed  by  Ohlander,  Reddy  and  Akin.  (1976)  The  experimental  tool 
used  yields  a  verbal  protocol  of  subject  behavior  during  the  "picture-puzzle"  task. 
This  paper  describes  an  interactive  computer  program  that  aids  the  trancription  of  the 
protocols  obtained. 

In  the  picture-puzzle  task  subjects  are  required  to  determine  the  contents  of  a 
color  photograph  (Figure  1  in  the  first  paper  of  this  volume,  entitled  "Knowledge 
Acquisition")  without  ever  seeing  it  but  by  asking  questions  about  it  to  the 
experimenter.  The  experimenter  answers  all  questions  about  the  photograph  that  do 
not  involve  high-level  concepts  and  objects.  The  only  information  given  about  the 
photograph  is  low-level  information  like  shapes,  colors,  locations,  textures  of  different 
regions  in  the  photograph.  The  protocol  consists  of  all  conversation  that  takes  place 
between  the  subject  and  the  experimenter.  (Table  l  in  the  first  paper  of  this  volume, 
entitled  "Knowledge  Acquisition") 

Protocol  analysis  has  been  used  by  Newell  and  Simon  (1972)  and  later  by  others 
(Eastman,  1970;  Baylor,  1971;  Farley,  1974)  to  analyze  similar  verbal  data.  Even 
though  Waterman  and  Newell  (1973)  have  developed  an  automated  protocol  analyzer, 
their  system  is  not  suitable  for  our  needs.  In  this  paper  we  present  a  framework  and 
an  interactive  computer  aid  for  the  analysis  of  protocols  obtained  from  the  "picture- 
puzzle"  task. 

The  objective  of  the  protocol  analysis  in  this  study  is  to  identify  the  knowledge 
sources  used  in  the  picture-puzzle  task.  The  categories  of  knowledge  sought  are 
three-fold;  Feature  Extraction  Operators,  Rewring  Rules,  and  Elements  of  Control  Flow. 
(Akin  and  Reddy,  1977)  The  categories  identified  with  ease  in  the  analysis  are  the 
Feature  Extraction  Operators  and  the  Rewriting  Rules.  The  protocol  analysis  also 
provides  some  insight  into  the  kinds  of  Control  Elements  used  in  the  task. 

Feature  Extraction  Operators:  Subjects  doing  the  picture  puzzle  task  use  a 
variety  of  descriptive  terms  to  identify  those  features  of  objects  necessary  for 
recognition.  These  terms  cover  the  categories;  scene  description,  size,  shape,  color, 
texture,  location,  quantity,  representational,  pallerns  and  miscellaneous  others. 


*  Also  in  the  Department  of  Architecture. 


Rewriting  Rales:  Some  production-like  rules  have  been  used  by  the  subjects, 
mostly  implicitly,  in  order  to  translate  the  low  level  scene  descriptors  into  high-level 
concepts  or  objects.  Some  examples  are:  “green  indicates  grass,"  "gray,  linear,  and 
horizontal  surfaces  indicate  roads.” 

Elements  of  Control  Flow:  Subjects  generally  used  a  hypothesize  and  test 
strategy.  Other  specific  strategies  were  also  employed  to  generate  the  next 
hypothesis,  apply  the  next  test,  and  determine  the  next  issue  to  be  explored  in  special 
task  contexts. 

I.  A  FRAMEWORK  FOR  THE  PICTURE -PUZZLE  TASK 

A  primary  objective  in  protocol  analysis  is  to  identify  the  problem  states  and  the 
operators  that  are  used  to  move  the  current  task  state  closer  to  a  solution  state 
incrementally.  The  protocol  analysis  system  used  here  tries  to  do  the  same.  A  set  of 
Task  Operators  have  been  defined  a  priori.  Some  of  these  Operators  applied  in  the 
picture-puzzle  task  are  identified  automatically  using  prior  knowledge  about  these 
operators  and  others  are  identified  manually  by  the  experimenter  in  an  interactive 
mode. 

Three  macro  Task  Operators  have  been  consistently  observed  in  all  protocols. 
These  are:  I)  Search-,  select  an  issue  or  aspect  of  the  scene  to  explore,  2)  Hypothesize-, 
generate  an  hypothesis  about  the  identity  of  the  issue(s)  being  explored,  and  3)  Test-, 
apply  appropriate  tests  to  clarify  the  hypotheses  generated.  All  three  kinds  of 
knowledge  defined  above  are  used  in  the  Search,  Hypothesize  and  Test  Operators. 

For  example  one  of  the  subjects  uses  the  knowledge  that  "scenes  can  be 
classified  into  two  in  general;  outdoors  and  indoors"  to  select  the  first  issue  to  deal 
with.  Then  he  generates  a  hypothesis  (i.e.,  outdoors)  based  on  the  same  knowledge. 
Later  he  tests  the  converse  hypothesis  as  well  (i.e.,  indoors).  In  testing  the  "outdoor" 
hypothesis  he  uses  the  Rewriting  Rules  that  "outdoor  scenes  contain  a  part  of  the  sky" 
and  "sky  is  blue."  After  both  tests  fail  (i.e.,  neither  outdoor  or  indoor  scene)  the 
subject  goes  back  to  the  above  Rewriting  Rules  and  modifies  them  to  read:  "outdoor 
scenes  contain  a  oart  of  the  sky,  unless  the  sky  is  completely  occluded  by  other 
objects,"  and  "overcast  skies  are  gray."  This  leads  to  the  correct  resolution  of  the 
issue,  i.e.,  the  scene  is  an  outdoor  scene  with  occluded  sky. 

II.  TRANSCRIPTION  OF  PROTOCOLS  AND  ANALYSIS 

Identification  of  the  Feature  Extraction  Operators  requires  manual  search  of  the 
text  for  terms  describing  some  visual  aspects  of  the  scene  or  some  of  its  parts. 
Identification  of  the  Rewriting  Rules  requires  the  determination  of  what  new 
information  is  acquired  by  the  subject  in  each  state  and  what  Rewriting  Rules  are 
being  applied  to  translate  all  the  accumulated  information  into  an  assertion  about  the 
scene.  Finally,  in  order  to  identify  the  Elements  of  Control  Flow  a  transcription  of  the 
protocol  into  a  form  in  which  patterns  of  search  are  clearly  seen  is  needed.  The  most 
proper  format  for  achieving  this  is  the  Problem  Behavior  Graph  used  by  Newell  and 
Simon.  (1972) 


Each  protocol  consists  of  questions  asked  by  subjects,  answers  given  by  the 
experimenter,  and  comments  made  by  subjects  (both  conjectures  and  notes  made  by 
the  subjects  about  their  own  behavior).  The  task  of  the  protocol  transcription  system 
presented  here  is  to  take  this  information  and  aid  the  analysis  in  coming  up  with  the 
three  kinds  of  knowledge  used  in  each  Task  Operator:  Search,  Hypothesize  and  Test. 
A  sample  of  a  transcribed  protocol  is  provided  in 

Table  5  in  the  first  paper  of  this  volume,  entitled  "Knowledge  Acquisition."  This 
sample  corresponds  to  the  first  seven  question-answer  sequences  ot  the  sample 
protocol. 

The  protocol  analysis  system  (PROTDO)  was  developed*  to  simplify  the  manual 
task  of  the  human  transcriber.  PROTDO  performs  four  major  operations.  First  it  gets 
the  file  of  the  protocol  to  be  transcribed.  Next,  it  displays  each  question-answer 
sequence  along  with  the  previous  and  the  next  question-answer  sequences  in  the 
protocols.  Then,  it  allows  the  transcriber  to  enter  all  Task  Operators  and  related 
knowledge  sources  for  each  question-answer  sequence  being  transcribed,  individually 
into  the  transcribed  file.  While  doing  so  PROTDO  stores  each  question-answer 
sequence  along  with  the  knowledge  entered  for  each  Task  Operator.  Some  knowledge 
sources,  such  as  Feature  Extraction  Operations,  are  built  into  the  "memory"  of  PROTDO. 
This  enables  PROTDO  to  automatically  identify  some  knowledge  sources.  Finally 
PROTDO  stores  all  this  information  in  a  new  file  before  quiting  on  the  protocol  being 
worked  on. 


III.  HOW  TO  USE  PROTDO 

The  first  question  a  potential  user  should  ask  himself  is  “do  I  really  need  to  use 
PROTDO"?  Because  PROTDO  is  a  program  especially  tuned  to  the  transcription  of 
protocols  taken  with  the  picture-puzzle  task  and  with  the  objective  of  discovering  the 
knowledge  sources  outlined  earlier.  Transcriptions  with  different  intent  and/or  other 
task  protocols  are  very  likely  to  be  unsuitable  for  PROTDO. 

When  PROTDO  is  run,  first  it  will  ask  the  user  if  he  needs  help  with  the  program. 
If  yes  "Y"  is  replied  a  brief  summary  of  program  usage  is  printed.  Next  the  user  is 
asked  the  file  name  to  be  processed,  followed  by  the  file  name  to  store  the 
transcribed  protocol  in.  Next,  the  number  of  the  protocol  to  be  transcribed  is 
requested  (multiple  protocols  can  be  stored  in  a  single  file,  each  delimited  by  a  page 
mark). 


PROTDO  then  asks  for  a  file  name  to  store  the  set  of  Rewriting  Rules  (RR)  under, 
and  a  file  name  for  the  collection  of  Elements  of  Control  Flow  (ECF).  To  avoid  creation 
of  either  file,  the  user  presses  the  return  key  without  typing  a  name  to  the  respective 
prompt. 

Now  PROTDO  can  start  to  process  the  protocol  selected.  First,  PROTDO  displays 
the  previous  question-answer  sequence  just  processed  along  with  the  present 
sequence  on  the  CRT.  In  this  fashion  PROTDO  displays  all  question-answer  sequences 
in  pairs  until  the  end  of  fhe  protocol 

*  PROTDO  has  been  programmed  by  Marty  Schultz. 


40 


The  processing  of  each  question-answer  sequence  displayed  consists  of  entering 
all  knowledge  sources  used  for  each  of  the  Task  Operators  in  that  sequence.  That  is 
for  each  of  the  three  Task  Operators  search,  hypothesize  and  test  all  three  types  of 
knowledge  sources  are  sought,  i.e.,  Feature  Extraction  Operators,  Rewriting  Rules  and 
Elements  of  Control  Flow.  The  previous  question-answer  sequences  is  displayed  with 
each  current  question-answer  sequence  to  enable  the  user  to  see  the  context  of  the 
current  sequence. 

After  PROTDO  has  displayed  the  appropriate  sequence  of  questions,  the  user  can 
enter  one  of  three  commands.  A  slash  "/"  instructs  the  program  to  terminate 
interactive  analysis,  and  finish  writing  the  files  using  only  that  which  has  already  been 
processed.  A  star  "a"  causes  PROTDO  to  ignore  this  sequence  and  go  on  to  the  next 
one.  Any  other  character  begins  interactive  analysis  of  the  present  sequence. 

The  first  thing  PROTDO  does  after  encountering  a  character  other  than  a  "/"  or  * 
is  to  display  the  keyword  "SEARCH"  as  the  first  category  of  Task  Operators.  At 
this  point  the  user  has  to  decide  what  issue  is  being  dealt  with  in  the  current  question. 
Then  the  user  has  to  type  in  the  issue  being  dealt  with  and  return  control  to  PROTDO. 
This  will  cause  PROTDO  to  save  that  entry  as  the  description  of  the  search  Operator  of 
the  current  question. 

The  other  Task  Operator  categories,  hypothesize  and  test,  are  processed 
similarly.  That  is  a  keyword  is  prompted  and  the  user  enters  a  hypothesis  or  test 
description.  PROTDO  automatically  proposes  the  text  of  the  question  asked  by  the 
subject  as  the  description  of  tho  test  Operator.  The  user  can  accept  this  description 
by  typing  "Y"  for  yes,  anything  else  for  no.  If  it  is  rejected  PROTDO  will  expect  the 
user  to  type  in  a  test  category  description  just  as  in  the  previous  two  categories  of 
Task  Operators. 

Right  after  successfully  entering  any  of  the  three  Task  Operator  descriptions, 
PROTDO  enables  the  user  to  enter  descriptions  of  the  three  classes  of  knowledge 
sources;  Feature  Extraction  Operators  (FEO),  Rewriting  Rules  (RR)  and  Elements  of 
Control  Flow  (EOF).  PROTDO  first  displays  the  appropriate  keyword  for  each 
knowledge  source  category,  i.e.,  FEO,  RR,  EOF.  For  each  keyword  PROTDO  expects  the 
user  to  either  accept  the  description  it  provides  automatically  or  to  enter  a  new 
description. 

PROTDO  has  a  memory  consisting  of  all  FEO's  it  has  ever  encountered.  Every 
time  a  new  FEO  is  entered  in  a  transcription  file  PROTDO  saves  it  in  its  memory  for 
future  transcriptions.  Hence  whenever  the  FEO  category  comes  up  during  a 
transcription  session  PROTDO  finds  words  in  the  Task  Operator  description  that  match 
FEOs  in  its  memory  and  displays  those  on  the  CRT  along  with  the  keyword  TEO.”  When 
PROTDO  chooses  the  Operators,  the  user  can  edit  these  choices. 

As  each  FEO  is  printed,  the  user  can  accept  it  by  typing  a  comma  or  a  period. 
The  comma  will  cause  the  FEO  in  the  final  transcription  to  be  separated  by  a  comma.  A 
period  requires  the  use  of  a  blank  as  the  separator.  This  latter  choice  is  used  in 
multiple  word  FEOs,  $uch  as  "with  respect  to.”  Any  other  character  typed  wiR  reject 
that  FEO  for  this  sequence.  After  all  operators  have  been  generated,  PROTDO  will  ask 


41 


if  any  others  (not  in  its  dictionary)  are  to  bo  included.  If  the  user  wishes  to  enter 
more,  he  types  them  here,  each  delimited  by  a  blank  or  a  comma.  Otherwise  the  user 
hits  the  return  key.  This  will  commence  the  entry  of  the  FEO  description.  The  FEOs 
added  here  are  subsequently  combined  in  PROTDO’s  dictionary  upon  program  exit. 

By  tho  time  all  three  Task  Operators  are  processed  the  information  entered  on 
the  CRT  will  have  been  stored  in  the  transcription  file  along  with  the  text  of  the 
current  question-answer  sequence.  After  the  completion  of  the  last  question-answer 
sequence  the  transcription  file  will  be  closed.  As  was  mentioned  before,  a  slash  can 
be  used  to  terminate  transcription  before  starting  the  processing  of  the  current 
question-answer  sequence.  This  will  cause  PROTDO  to  save  the  total  transcription 
completed  up  to  the  current  question-answer  sequence  in  the  transcription  file.  The 
Rewriting  Rule  and  Elements  of  Control  Flow  files  will  also  be  saved,  if  they  were 
declared  at  the  onset. 


IV.  CONCLUSIONS 

PROTDO  is  useful  for  a  special  kind  of  transcription,  i.e.,  looking  for  knowledge 
sources,  in  the  picture -puzzle  task.  Consequently  its  usefulness  in  the  general  sense 
is  limited.  However,  it  provides  for  us  a  rich  catalogue  of  the  knowledge  used  in  the 
specific  area  of  research. 

Furthermore  the  output  of  PROTDO  can  be  easily  translated  into  the  Problem 
Behavior  Graph  format.  This  is  necessary  for  observing  the  general  patterns  of 
Control  Flow.  Each  task  operation  included  in  the  transcription  represents  a 
modification  in  the  problem  state.  Hence  these  are  represented  as  right  arrows  linking 
nodes  (problem  states)  in  Figure  1.  Every  time  a  question-answer  sequence  does  not 
alter  the  problem  state,  that  is  the  task  operation  is  the  same  as  the  previous  one,  the 
down  arrows  are  used  to  indicate  no  advance  in  the  problem  state.  The  links  starting 
from  earlier  nodes  indicate  backtracking  which  correspond  to  going  back  to  an  issue 
dealt  with  earlier  in  the  transcription.  The  Problem  Behavior  Graphs  obtained  from 
different  tasks  is  expected  to  yieid  a  more  parsimonious  understanding  of  the  Elements 
of  Control  Flow. 


42 


REFERENCES 


1.  Akin,  0.  and  Reddy,  R.,  Knowledge  Acquisition  for  Image  Understanding  Research, 

Journal  of  Computer  Graphics  and  Image  Processing  1977  (in  print). 

2.  Newell,  A.  and  Simon,  K,  Human  Problem  Solving,  Prentice  Hall,  New  York,  1972. 

3.  Ohlander,  R.  B.;  Reddy,  R.  and  Akin,  0-,  An  Experimental  System  for  Knowledge 

Acquisition  in  Image  Understanding  Research,  Computer  Science  Department 
Report,  Carnegie -Mellon  University,  1976. 

4.  Waterman,  D.  A.  and  Newell,  A.,  PAS-11:  An  interactive  task-free  version  of  an 

automatic  protocol  analysis  system,  Third  International  Joint  Conference  on 
Artificial  Intelligence ,  1973. 


FIGURE  1 .  PROBLEM  BEHAVIOR  GRAPH  OF  TRANSCRIPTION. 


Eye  Fixations  in  Image  Understanding  Research 


Omer  Akin* 

Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania  15213 


This  study  explores  an  alternative  experimental  tool  for  discovering  knowledge 
used  in  understanding  visual  scenes.  This  issue  has  been  examined  earlier  by  Akin  and 
Reddy  (1977)  using  verbal  protocols.  The  possibilities  of  using  visual  protocols,  i.e., 
eye  fixations,  in  achieving  the  same  ends  will  be  explored  in  this  paper. 

L  EYE  FIXATIONS  AS  MEASUREMENT  IN  VISUAL  INFORMATION  PROCESSING 

A  most  frequently  asked  question  in  research  dealing  with  visual  perception  of 
complex  scenes  is  simply,  "How  do  we  perceive  pictures?"  More  specificly  this  question 
has  taken  the  form: 


"..how  information  from  a  visual  scene  is  encoded?"  (Loftus,  1974) 

"What  does  a  person  do  when  he  looks  at  a  picture?"  (Buswell,  1935) 

"..[do]  key  regions  exist  within  pictorial  displays.,  [and  are]  some  stimuli 
more  important  than  others  within  the  displays?"  (Mackworth  and  Morandi, 
1976) 

Alternative  experimental  means  have  been  used  to  uncover  the  visual 
understanding  process.  Use  of  eye  fixations  in  image  understanding  has  been  an 
important  research  tool.  Below  we  shall  review  a  representative  sample  of  major 
studies  done  in  the  area  of  visual  perception  using  eye  fixations. 

One  of  the  earliest  and  most  extensive  eye  fixation  studies  was  undertaken  by 
Buswell.  (1935)  His  experiments  consist  of  measuring  eye  fixations  of  subjects 
observing  various  stimuli  under  different  task  conditions.  The  main  emphasis  of  the 
experiment  is  the  interpretation  of  eye  fixation  patterns. 

More  recently,  Loftus  (1974)  has  dealt  with  the  issue  of  recognition.  He  has 
recorded  eye  fixations  and  recognition  responses  of  subjects  perceiving  complex 
scenes.  He  has  also  altered  the  representation  and  contents  of  stimuli  to  control 
information  transmission. 

Mackworth  and  Morandi  (1976)  looked  at  fixations  and  the  judgment  of 
"recognizability"  of  subjects  with  two  complex  stimuli.  They  have  analyzed  the  data  by 
subdividing  the  stimuli  into  64  equal  parts. 

*  Also  in  the  Department  of  Architecture. 

45 


All  three  studies  essentially  explore  the  processes  responsible  for 
"understanding"  and/or  "recognition"  of  pictures.  Yet,  all  have  used  different  means  of 
pursueing  this  goal.  Here  1  shall  report  on  the  characteristics  of  these  alternative 
experimental  means  and  what  each  yields  in  terms  of  knowledge  in  the  area. 

There  are  basidy  four  major  experimental  means  used  in  these  studies.  The 
measurement  of  eye  fixations  seems  to  be  the  common  denominator  of  all.  (Loftus, 1974; 
Buswell,  1935;  Mackworth  and  Morandi,  1976)  A  second  experimental  measure  used  is 
recognition  of  a  previously  seen  image.  (Loftus,  1974)  The  third  paradigm  is  the  use  of 
subjective  ranking  of  some  qualitative  aspect  of  the  stimuli  by  the  subjects. 
(Mackworth  and  Morandi,  1976)  And  the  fourth  experimental  means  used  is  the 
decomposition  of  stimuli  into  smaller,  or  less  comprehensive  parts.  (Mackworth  and 
Morandi,  1976)  Below  we  shall  discuss  the  role  of  eye  fixations  in  relation  to  other 
experimental  tools. 

Of  course  the  central  issue  in  the  use  of  eye  fixation  data  is  just  what  the 
fixation  corresponds  to  in  terms  of  cognitive  processes.  Buswell  states  the  common 
explanation  to  the  issue  in  the  following  terms: 

the  center  of  fixation  of  the  eyes  is  the  center  of  attention  at  a  given  time... 
The  evidence  [provided  by  fixations]  in  regard  to  perceptual  patterns  is  entirely 
objective,  but  it  furnishes  no  indication,  except  by  inference,  as  to  what  the  nature  of 
the  subject’s  inner  response  to  the  picture  may  be  ”  (Buswell,  1935) 

Buswell’s  main  concern  stems  from  the  large  variance  in  fixation  durations  --i.e., 
3-40  thirtieths  of  a  second.  He  attempts  to  explain  this  variance  as  a  function  of 
stimulus  characteristics  and  stages  of  the  perception  process.  On  the  other  hand  this 
mere  inferential  evidence  is  rather  significant.  Loftus  has  suggested  that  even  though 
the  fixation  durations  in  a  recognition  task  vary  considerably,  the  subject’s 
performance  is  a  function  of  the  number  of  fixations  rather  than  the  duration  of 
fixations.  This  implies  that  the  amount  of  information  acquired  during  a  fixation  is  more 
or  less  constant.  Therefore  the  variance  in  the  duration  of  the  fixation  results  due  to 
processes  other  than  information  gathering  that  takes  place  during  a  fixation  —  such 
as  what-part-of-the-picture-io-process-next. 

Loftus  has  also  shown  that  by  motivating  the  subjects  to  perform  better  it  is 
possible  to  reduce  average  fixation  durations  without  affecting  recognition 
performance.  This  indicates  that  some  extraneous  processes  or  simply  idle  time  may  be 
responsible  for  this  variance. 

The  single  study  which  has  explored  eye  fixations  most  extensively  and 
exclusively  is  Boswell's  "How  People  Look  At  Pictures."  Location,  duration  and 
sequence  of  fixations  have  been  looked  at  under  various  stimulus,  subject  and  task 
conditions.  He  has  inferred  differential  picture  processing  stages  as  a  function  of  the 
time  dimension  and  task  description,  as  a  function  of  fixation  data. 

He  found  initial  fixations  to  be  always  shorter  than  successive  ones.  This  is 
attributed  to  the  use  of  central  cognitive  processing  in  addition  to  simple  visual 
processing,  as  the  "understanding"  of  a  picture  becomes  more  detailed  and/or  more 


46 


semantic.  The  evidence  provided  by  Mockworth  et.al.  and  others*  (1976;  Potter  and 
Levy,  1969;  Pollack  and  Spence,  1968)  findings  indicate  that  the  first  fixations  serve  a 
different  purpose,  namely  that  of  finding  out  the  "gist**  of  a  picture,  as  opposed  to  the 
later  ones.  Loftus  has  analyzed  also  the  individual  fixations  discovering  underlying 
internal  perceptual  processes.  He  concludes  that  in  terms  of  information  gathering  a 
fixation  performs  a  standard  function  independent  of  its  duration  beyond  the  first  100 
ms. 


Hence  there  seems  to  be  two  major  functions  of  a  fixation.  The  first  100  ms.  or 
so  constituting  the  information  gathering  and  the  remainder  of  the  fixation  duration 
deriving  from  the  knowledge  about  the  picture  a  next  target  location  to  fixate  upon. 
(Loftus,  1974)  If  we  assume  that  the  information  about  the  picture  is  internally 
represented  in  a  structure  isomorphic  to  a  hierarchic  structure  (i.e.,  more  processing 
time  required  for  processing  more  detailed  parts  of  the  picture)  then  it  is  plausible 
that  the  Subjects  involved  in  detailed  analysis  in  the  later  stages  of  processing  have 
longer  fixation  durations. 

Buswell  found  that  different  task  situations,  such  as  simple  perception,  scanning 
for  target  recognition  or  subjective  judgment  of  picture  quality  tasks,  produced 
different  fixation  patterns.  This  indicates  that  the  information  provided  by  fixation 
behavior  in  visual  tasks  is  extremely  rich.  However  there  is  little  theoretical  basis  for 
explaining  the  underlying  processes  responsible  for  these  differences. 

It.  EXPERIMENT 

A  basic  problem  in  all  eye  fixation  studies  of  picture  understanding,  is  the  lack 
of  a  general  theory  of  the  picture  understanding  process.  With  the  recognition  of  this 
fact,  we  have  done  some  eye  fixation  studies  using  the  same  images  analyzed  in  the 
paper  entitled  "Knowledge  Acquisition  in  Image  Understanding"  and  using  the 
framework  developed  in  the  same  study.  (Akin  and  Reddy,  1977)  Based  on  the  findings 
of  the  studies  reviewed  above  we  have  analyzed  the  pattern  of  fixations  rather  than 
latencies  to  infer  the  search  behavior  exhibited.  The  results  are  inconclusive  and  have 
lead  to  more  questions  than  they  have  answered.  However,  we  present  some  of  the 
preliminary  findings  to  expose  the  state  of  our  research  to  other  interested  parties. 

The  eye  fixation  experiment  consisted  of  instructing  subjects  to  examine  a 
certain  feature,  i.e.,  intersection  of  two  major  traffic  arteries  in  downtown  Pittsburgh, 
in  a  map.  (Figure  3  in  the  second  paper  in  this  volume,  entitled  "An  Experimental 
System")  Later  subjects  were  instructed  to  find  that  particular  land-mark,  the 
intersection,  in  a  photograph  of  the  same  area  (Figure  2  in  the  first  paper  of  this 
volume,  entitled  "Knowledge  Acquisition").  The  protocol  of  the  visual  search  behavior 
of  the  subjects  were  taken  by  recording  their  eye  fixations.  An  image  of  the 
photograph  and  fixations  were  super-imposed  on  video-tape  during  the  experiments. 
Two  subjects  were  used  in  this  task.  Samples  from  the  protocols  of  these  two 
subjects  are  contained  in  Figures  1  and  2.  The  consecutive  numbers  in  these  figures 
indicate  the  sequence  of  the  fixations  in  each  experiment.  Note  that  the  numbers  also 
indicate  the  location  of  the  center  of  each  fixation  which  was  about  1/2”  in  diameter. 


47 


III.  ANALYSIS 


The  patterns  obtained  in  the  eye  fixation  protocols  are  compared  against  the 
issues  explored  in  the  protocols  of  the  picture-puzzle  experiment.  In  the  picture- 
puzzle  tash  subjects  are  instructed  to  find  the  same  traffic  intersection  in  the 
photograph  of  the  downtown  area  after  examining  the  map.  But  in  this  case  the 
subjects  are  not  allowed  to  examine  the  photograph  visually.  They  are  given  verbal 
information  about  the  photograph  by  the  experimenter  when  they  ask  for  it.  This 
experiment  is  described  in  detail  in  the  paper  entitled  "Knowledge  Acquisition  in  Image 
Understanding  Research."  (Ohlander,  Reddy  and  Akin,  1976) 

First  it  should  be  emphasised  that  the  processes  underlying  the  two  experiments 
are  radically  different.  In  the  case  of  the  eye  fixation  experiment  the  subjects  analyze 
"meaningful"  parts  of  what  is  visually  available  in  each  photograph.  While  the  exact 
nature  of  the  underlying  processes  which  derive  the  fixations  are  still  a  mystery  the 
general  consensus  is  that  fixations  represent  those  parts  of  the  scene  which  are 
directly  informative  for  each  respective  processing  stage  encountered  during  the 
interpretation  of  the  visual  image. 

On  the  other  hand,  the  subjects  searching  for  a  target  in  a  photograph  in  the 
picture-puzzle  task  seem  to  construct  infernal  representations  of  stimuli  based  on  the 
verbal  feedback  obtained  from  the  experimenter.  Subsequent  search  of  the  scene  is 
based  on  this  partial,  and  at  times  errorful,  representation  of  the  scene.  The 
construction  of  the  internal  representation  is  therefore  radically  different  from  the 
case  where  the  search  is  based  on  a  complete  visual  scene,  as  in  the  eye  fixation 
experiments. 

The  initial  information  explored  in  the  case  of  the  picture-puzzle  task  about  an 
object,  such  as  a  building,  usually  pertains  to  a  simple  descriptive  property,  i.e., 
trapezoidal  outline(s).  While  an  eye  fixation  on  the  same  object  (the  building)  readily 
extracts  information  (possibly  in  parallel)  about  many  aspects  of  that  object,  i.e.,  shape, 
texture,  orientation,  occlusions,  shadows,  the  environment,  etc. 

Despite  these  differences  it  is  possible  to  observe  some  parallelism  between 
these  two  processes.  Evidence  suggests  that  successive  questions  about  a  single 
entity  in  the  picture-puzzle  experiment  extract  information  about  many  descriptive 
aspects,  i.e.,  shape,  texture,  orientation,  etc.  (Akin  and  Reddy,  1977)  This  is  similar  to 
the  case  of  the  eye  fixation  paradigm  with  the  exception  that  the  same  information 
may  be  obtained  in  parallel  in  the  latter  case. 

IV.  RESULTS 

In  the  discussion  below,  we  shaN  compare  the  patterns  of  eye  fixations  against 
the  issues  explored  by  successive  sets  of  questions  in  the  picture-puzzle  experiment. 
For  example,  the  subject  in  the  sample  protocol  from  the  picture-puzzle  experiment 
(Table  1  in  the  second  paper  of  this  volume,  entitled  "An  Experimental  System") 
examines  first,  the  river-,  second,  the  sky;  third,  the  river;  fourth,  the  buildings;  fifth, 
the  roads;  sixth,  the  buildings;  seventh,  greenery  and  eighth,  the  road  and  the 
intersection.  These  actions  are  respectively  numbered  in  the  protocol  in  the  table. 


48 


This  reflects  a  characteristic  pattern  where  the  subject  starts  from  a  familiar 
Object  (some  thing  he  can  identify  readily  such  as  the  sky,  the  river,  etc.)  in  the  map 
and  then  scans  all  objects  that  are  expected  to  lie  in  the  path  joining  the  point  of 
departure  to  the  target  object  (the  intersection).  Similar  patterns  are  seen  in  the 
fixation  data  where  sets  of  successive  fixations  land  on  the  same  characteristic  objects. 
For  example,  consider  Figure  1.  The  first  few  fixations  of  Subject  1  (1-6)  land  around 
the  initial  fixation  (0)  in  the  center  of  tho  scene.  Then  they  successively  fall  on  the 
river  (7-8),  the  buildings  (9-11),  the  greenery  and  the  roads  (12-13),  the  buildings 
(14-21),  one  of  the  target  roads  (22-24)  and  finally  the  intersection  (24-25).  The  rest 
of  the  protocol  consists  of  fixations  that  appear  to  repeat  this  pattern  of  fixations. 
This  can  be  attributed  to  the  fact  that  the  subject  may  want  to  verify  his  initial 
findings  by  repeating  his  earlier  perceptual  actions. 

The  striking  similarity  in  the  sequence  of  the  parts  of  the  scene  looked  at  in 
each  experiment  is  typical.  This  does  not  necessitate  that  we  should  get  the  same 
results  every  time.  This  is  obvious  if  we  consider  the  degrees  of  freedom  there  are  in 
finding  a  path  between  the  target  and  a  randomly  selected  point  of  departure  of 
search.  However,  the  results  obtained  here  leads  us  to  believe  that  the  kinds  of 
control  exercised  in  the  two  experiments  examined  here  are  very  similar. 

This  result  is  intuitively  correct.  A  next  fixation  is  possibly  made  to  add  to  the 
current  knowledge  of  the  system  about  the  scene,  and  driven  by  the  goal  of  finding 
the  target  in  the  photograph.  While  in  the  picture-puzzle  task  each  "next"  question 
also  serves  the  same  purpose.  Hence,  with  proper  aggregation  of  fixations  and 
questions  it  should  be  expected  that  similar  patterns  of  control  can  be  observed  in 
both  experiments. 


V.  CONCLUSIONS 

The  eye  fixation  data  indicates  one  major  result.  The  picture-puzzle  paradigm 
used  in  the  experiment  reported  earlier  is  an  experimental  tool  for  accurately 
simulating  the  actual  visual  understanding  process.  This  on  the  one  hand  supports  our 
experimental  assumptions  and  on  the  other  hand  provides  a  more  direct  means  for 
exploring  the  issue  of  Control  Flow  in  visual  understanding. 

Ideally,  what  needs  to  be  done  in  the  eye  fixation  experiment  is  to  enable  the 
subjects  to  observe  the  map  and  the  photograph  simultaneously,  while  the  protocol  of 
eye  fixations  are  taken.  By  recording  the  patterns  of  fixations  for  both  stimuli  it  will 
be  possible  to  infer  more  directly  the  information  obtained  from  the  map  that  directs 
the  flow  of  eye  fixations  towards  the  target  in  the  photograph. 

VI.  ACKNOWLEDGEMENTS 

Wc  are  indebted  to  the  Department  of  Psychology  for  allowing  us  to  make  use  of 
their  eye  fixation  recording  equipment.  We  also  thank  Patricia  Carpenter  and  Adrian 
Baer  for  the  valuable  protocols  they  have  provided. 


49 


REFERENCES 

1.  Akin,  0.  and  Reddy,  R.  Knowledge  Acquisition  in  Image  Understanding  Research, 

Journal  of  Computer  Graphics  and  Image  Processing,  1977  (in  print). 

2.  Buswell,  G.  T.,  How  People  Look  at  Pictures,  University  Press  Chicago,  1935. 

3.  Loftus,  G.  R.,  A  framework  tor  a  theory  of  picture  recognition,  presented  at  the 

National  Academy  of  Sciences  Specialist’s  Meeting  on  Eye  Movements  and 
Psychological  Processes,  Princeton,  N.J.,  Apr.  1974. 

4.  Mackworth,  N.  H.  and  Morandi,  A.  J.,  The  gaze  selects  informative  details  within 

pictures,  Perception  and  Psychophysics,  1976,  2,  547-551. 

5.  Ohlandcr,  R„  Reddy,  R.,  and  Akin,  0.  An  Experimental  System  for  Knowledge 

Acquisition  in  Image  Understanding  Research,  Computer  Science  Department 
Reports,  Carnegie-Mellon  University,  1976. 

6.  Pollack,  I.  and  Spence,  0.,  Subjective  pictorial  Information  and  Visual  Search, 

Perception  and  Psychophysics,  1968,  3,  41-44. 

7.  Potter,  M.  C.  and  Levy,  E.  I.,  Recognition  memory  for  a  rapid  sequence  of  pictures, 

Journal  of  Experimental  Psychology,  1969,  81,  10-15. 


i 


50 


*“  -02S  Qo*  8*90 

production  FIGURE  2  EYE  FIXATION  PROTOCOL  OF  SUBJECT  2. 

Continued. 


